无法理解BeautifulSoup过滤

编程入门行业动态更新时间:2024-10-25 05:28:56

本文介绍了无法理解BeautifulSoup过滤的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

有人可以解释一下美丽汤"如何进行过滤.香港专业教育学院得到了下面的HTML，我试图从中过滤特定的数据，但我似乎无法访问它.香港专业教育学院尝试了各种方法，从收集所有class=g到只抓取该特定div中感兴趣的项目，但我只得到无回报或没有印刷品.

Could someone please explain how the filtering works with Beautiful Soup. Ive got the below HTML I am trying to filter specific data from but I cant seem to access it. Ive tried various approaches, from gathering all class=g's to grabbing just the items of interest in that specific div, but I just get None returns or no prints.

每个页面都有一个<div class="srg"> div和多个<div class="g"> div，我要使用的数据是带有<div class="g">的数据.每个都有多个div，但我只对<cite>和<span class="st">数据感兴趣.我正在努力了解过滤的工作原理，我们将不胜感激.

Each page has a <div class="srg"> div with multiple <div class="g"> divs, the data i am looking to use is the data withing <div class="g">. Each of these has multiple divs, but im only interested in the <cite> and <span class="st"> data. I am struggling to understand how the filtering works, any help would be appreciated.

我已尝试逐步遍历div并抓住相关字段:

I have attempted stepping through the divs and grabbing the relevant fields:

soup = BeautifulSoup(response.text) main = soup.find('div', {'class': 'srg'}) result = main.find('div', {'class': 'g'}) data = result.find('div', {'class': 's'}) data2 = data.find('div') for item in data2: site = item.find('cite') comment = item.find('span', {'class': 'st'}) print site print comment

我还尝试过进入初始div并查找所有内容；

I have also attempted stepping into the initial div and finding all;

soup = BeautifulSoup(response.text) s = soup.findAll('div', {'class': 's'}) for result in s: site = result.find('cite') comment = result.find('span', {'class': 'st'}) print site print comment

测试数据

<div class="srg"> <div class="g"> <div class="g"> <div class="g"> <div class="g">  <div class="rc" data="30"> <div class="s"> <div> <div class="f kv _SWb" style="white-space:nowrap"> <cite class="_Rm">www.url.stuff/here</cite> <span class="st">www.url. Some info on url etc etc </span> </div> </div> </div>  </div> <div class="g"> <div class="g"> <div class="g"> </div>

更新

在Alecxe解决方案之后，我再次尝试使它正确，但仍然没有打印出任何东西.因此，我决定再次查看soup，它看起来有所不同.我以前在看requests中的response.text.我只能认为BeautifulSoup会修改response.text，或者我第一次以某种方式完全弄错了示例(不确定如何).但是，以下是基于我从soup打印中看到的内容的新示例.在此之下，我尝试获取我要的元素数据.

After Alecxe's solution I took another stab at getting it right but was still not getting anything printed. So I decided to take another look at the soup and it looks different. I was previously looking at the response.text from requests. I can only think that BeautifulSoup modifies the response.text or I somehow just got the sample completely wrong the first time (not sure how). However Below is the new sample based on what I am seeing from a soup print. And below that my attempt to get to the element data I am after.

<li class="g"> <h3 class="r"> <a href="/url?q=url">context</a> </h3> <div class="s"> <div class="kv" style="margin-bottom:2px"> <cite>www.url/index.html</cite> #Data I am looking to grab <div class="_nBb">‎ <div style="display:inline"snipped"> <span class="_O0"></span> </div> <div style="display:none" class="am-dropdown-menu" role="menu" tabindex="-1"> <ul> <li class="_Ykb"> <a class="_Zkb" href="/url?/search">Cached</a> </li> </ul> </div> </div> </div> <span class="st">Details about URI </span> #Data I am looking to grab

更新尝试

到目前为止，我已经尝试过采用Alecxe的方法，但没有成功，我是否正沿着正确的道路前进?

I have tried taking Alecxe's approach to no success so far, am I going down the right road with this?

soup = BeautifulSoup(response.text) for cite in soup.select("li.g div.s div.kv cite"): span = cite.find_next_sibling("span", class_="st") print(cite.get_text(strip=True)) print(span.get_text(strip=True))

推荐答案

您不必手动处理层次结构-让BeautifulSoup担心它.第二种方法接近您应该真正尝试做的事情，但是一旦获得div和class="s"并且内部没有cite元素，该方法就会失败.

You don't have to deal with the hierarchy manually - let BeautifulSoup worry about it. Your second approach is close to what you should really be trying to do, but it would fail once you get the div with class="s" with no cite element inside.

相反，您需要让BeautifulSoup知道您对包含特定元素的特定元素感兴趣.让我们要求位于div元素内的div元素内的div元素与class="srg"-div.srg div.g cite CSS选择器可以找到我们所要询问的确切信息:

Instead, you need to let BeautifulSoup know that you are interested in specific elements containing specific elements. Let's ask for cite elements located inside div elements with class="g" located inside the div element with class="srg" - div.srg div.g cite CSS selector would find us exactly what we are asking about:

for cite in soup.select("div.srg div.g cite"): span = cite.find_next_sibling("span", class_="st") print(cite.get_text(strip=True)) print(span.get_text(strip=True))

然后，一旦找到cite，我们就侧身走"并用class="st"抓住下一个span兄弟元素.不过，是的，在这里我们假设它存在.

Then, once the cite is located, we are "going sideways" and grabbing the next span sibling element with class="st". Though, yes, here we are assuming it exists.

对于提供的样本数据，它将打印:

For the provided sample data, it prints:

www.url.stuff/here www.url. Some info on url etc etc

用于更新的输入数据的更新的代码:

The updated code for the updated input data:

for cite in soup.select("li.g div.s div.kv cite"): span = cite.find_next("span", class_="st") print(cite.get_text(strip=True)) print(span.get_text(strip=True))

此外，请确保您使用的是第4个BeautifulSoup版本:

Also, make sure you are using the 4th BeautifulSoup version:

pip install --upgrade beautifulsoup4

并且导入语句应为:

from bs4 import BeautifulSoup

更多推荐

无法理解BeautifulSoup过滤

本文发布于:2023-07-27 14:05:18，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1222763.html