如何在python beautifulsoup中抓取交替的子标签(how to grab alternating child tags in python beautifulsoup)
我试图从html页面中的交替标签获取一系列数据。 html看起来像这样:
<div> <h3>title</h3> <div>text</div> <h3>title</h3> <div>text</div> ... </div>由于我不能在“为div中的每一对”中获取每个h3 / div对,如何有效地抓住它们?
I am trying to get a series of data from alternating tags in a html page. The html looks like this:
<div> <h3>title</h3> <div>text</div> <h3>title</h3> <div>text</div> ... </div>Since I can't grab each h3/div pair in a "for each pair in div", how to I grab them efficiently?
最满意答案
找到所有标题,然后从那里抓住下一个兄弟 :
for header in soup.select('div h3'): next_div = header.find_next_sibling('div')如果找不到这样的兄弟, element.find_next_sibling()返回一个元素或None 。
演示:
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup('''\ ... <div> ... <h3>First header</h3> ... <div>First div to go with a header</div> ... <h3>Second header</h3> ... <div>Second div to go with a header</div> ... </div> ... ''') >>> for header in soup.select('div h3'): ... next_div = header.find_next_sibling('div') ... print(header.text, next_div.text) ... First header First div to go with a header Second header Second div to go with a headerFind all headers, and grab the next sibling from there:
for header in soup.select('div h3'): next_div = header.find_next_sibling('div')element.find_next_sibling() returns an element or None if no such sibling can be found.
Demo:
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup('''\ ... <div> ... <h3>First header</h3> ... <div>First div to go with a header</div> ... <h3>Second header</h3> ... <div>Second div to go with a header</div> ... </div> ... ''') >>> for header in soup.select('div h3'): ... next_div = header.find_next_sibling('div') ... print(header.text, next_div.text) ... First header First div to go with a header Second header Second div to go with a header
更多推荐
发布评论