有人可以解释一下美丽汤"如何进行过滤.香港专业教育学院得到了下面的HTML,我试图从中过滤特定的数据,但我似乎无法访问它.香港专业教育学院尝试了各种方法,从收集所有class=g到只抓取该特定div中感兴趣的项目,但我只得到无回报或没有印刷品.
Could someone please explain how the filtering works with Beautiful Soup. Ive got the below HTML I am trying to filter specific data from but I cant seem to access it. Ive tried various approaches, from gathering all class=g's to grabbing just the items of interest in that specific div, but I just get None returns or no prints.
每个页面都有一个<div class="srg"> div和多个<div class="g"> div,我要使用的数据是带有<div class="g">的数据.每个都有 多个div,但我只对<cite>和<span class="st">数据感兴趣.我正在努力了解过滤的工作原理,我们将不胜感激.
Each page has a <div class="srg"> div with multiple <div class="g"> divs, the data i am looking to use is the data withing <div class="g">. Each of these has multiple divs, but im only interested in the <cite> and <span class="st"> data. I am struggling to understand how the filtering works, any help would be appreciated.
我已尝试逐步遍历div并抓住相关字段:
I have attempted stepping through the divs and grabbing the relevant fields:
soup = BeautifulSoup(response.text) main = soup.find('div', {'class': 'srg'}) result = main.find('div', {'class': 'g'}) data = result.find('div', {'class': 's'}) data2 = data.find('div') for item in data2: site = item.find('cite') comment = item.find('span', {'class': 'st'}) print site print comment我还尝试过进入初始div并查找所有内容;
I have also attempted stepping into the initial div and finding all;
soup = BeautifulSoup(response.text) s = soup.findAll('div', {'class': 's'}) for result in s: site = result.find('cite') comment = result.find('span', {'class': 'st'}) print site print comment测试数据
<div class="srg"> <div class="g"> <div class="g"> <div class="g"> <div class="g"> <!--m--> <div class="rc" data="30"> <div class="s"> <div> <div class="f kv _SWb" style="white-space:nowrap"> <cite class="_Rm">www.url.stuff/here</cite> <span class="st">www.url. Some info on url etc etc </span> </div> </div> </div> <!--n--> </div> <div class="g"> <div class="g"> <div class="g"> </div>更新
在Alecxe解决方案之后,我再次尝试使它正确,但仍然没有打印出任何东西.因此,我决定再次查看soup,它看起来有所不同.我以前在看requests中的response.text.我只能认为BeautifulSoup会修改response.text,或者我第一次以某种方式完全弄错了示例(不确定如何).但是,以下是基于我从soup打印中看到的内容的新示例.在此之下,我尝试获取我要的元素数据.
After Alecxe's solution I took another stab at getting it right but was still not getting anything printed. So I decided to take another look at the soup and it looks different. I was previously looking at the response.text from requests. I can only think that BeautifulSoup modifies the response.text or I somehow just got the sample completely wrong the first time (not sure how). However Below is the new sample based on what I am seeing from a soup print. And below that my attempt to get to the element data I am after.
<li class="g"> <h3 class="r"> <a href="/url?q=url">context</a> </h3> <div class="s"> <div class="kv" style="margin-bottom:2px"> <cite>www.url/index.html</cite> #Data I am looking to grab <div class="_nBb"> <div style="display:inline"snipped"> <span class="_O0"></span> </div> <div style="display:none" class="am-dropdown-menu" role="menu" tabindex="-1"> <ul> <li class="_Ykb"> <a class="_Zkb" href="/url?/search">Cached</a> </li> </ul> </div> </div> </div> <span class="st">Details about URI </span> #Data I am looking to grab更新尝试
到目前为止,我已经尝试过采用Alecxe的方法,但没有成功,我是否正沿着正确的道路前进?
I have tried taking Alecxe's approach to no success so far, am I going down the right road with this?
soup = BeautifulSoup(response.text) for cite in soup.select("li.g div.s div.kv cite"): span = cite.find_next_sibling("span", class_="st") print(cite.get_text(strip=True)) print(span.get_text(strip=True))推荐答案
您不必手动处理层次结构-让BeautifulSoup担心它.第二种方法接近您应该真正尝试做的事情,但是一旦获得div和class="s"并且内部没有cite元素,该方法就会失败.
You don't have to deal with the hierarchy manually - let BeautifulSoup worry about it. Your second approach is close to what you should really be trying to do, but it would fail once you get the div with class="s" with no cite element inside.
相反,您需要让BeautifulSoup知道您对包含特定元素的特定元素感兴趣.让我们要求位于div元素内的div元素内的div元素与class="srg"-div.srg div.g cite CSS选择器可以找到我们所要询问的确切信息:
Instead, you need to let BeautifulSoup know that you are interested in specific elements containing specific elements. Let's ask for cite elements located inside div elements with class="g" located inside the div element with class="srg" - div.srg div.g cite CSS selector would find us exactly what we are asking about:
for cite in soup.select("div.srg div.g cite"): span = cite.find_next_sibling("span", class_="st") print(cite.get_text(strip=True)) print(span.get_text(strip=True))然后,一旦找到cite,我们就侧身走"并用class="st"抓住下一个span兄弟元素.不过,是的,在这里我们假设它存在.
Then, once the cite is located, we are "going sideways" and grabbing the next span sibling element with class="st". Though, yes, here we are assuming it exists.
对于提供的样本数据,它将打印:
For the provided sample data, it prints:
www.url.stuff/here www.url. Some info on url etc etc
用于更新的输入数据的更新的代码:
The updated code for the updated input data:
for cite in soup.select("li.g div.s div.kv cite"): span = cite.find_next("span", class_="st") print(cite.get_text(strip=True)) print(span.get_text(strip=True))
此外,请确保您使用的是第4个BeautifulSoup版本:
Also, make sure you are using the 4th BeautifulSoup version:
pip install --upgrade beautifulsoup4并且导入语句应为:
from bs4 import BeautifulSoup更多推荐
无法理解BeautifulSoup过滤
发布评论