Beautifulsoup找不到所有(Beautifulsoup Cannot FindAll)

编程入门 行业动态 更新时间:2024-10-25 22:32:31
Beautifulsoup找不到所有(Beautifulsoup Cannot FindAll)

我试图抓住nature.com对期刊文章进行一些分析。 当我执行以下操作时:

import requests from bs4 import BeautifulSoup import re query = "http://www.nature.com/search?journal=nature&order=date_desc" for page in range (1, 10): req = requests.get(query + "&page=" + str(page)) soup = BeautifulSoup(req.text) cards = soup.findAll("li", "mb20 card cleared") matches = re.findall('mb20 card cleared', req.text) print(len(cards), len(matches))

我期望Beautifulsoup打印“25”(搜索结果的数量)10次(每页一个),但它没有。 相反,它会打印:

14, 25 12, 25 25, 25 15, 25 15, 25 17, 25 17, 25 15, 25 14, 25

查看html源代码显示,每个页面应该返回25个结果,但Beautifulsoup似乎在这里混淆,我找不出原因。

更新1如果它很重要,我使用Anaconda Python 2.7.10和bs4版本4.3.1在Mac OSX Mavericks上运行

更新2我添加了一个正则表达式来显示req.text确实包含我正在寻找的东西,但是美丽的东西并没有找到它

更新3当我多次运行这个简单的脚本时,我有时会遇到“分段错误:11”。 不知道为什么

I'm trying to scrape nature.com to perform some analysis on journal articles. When I execute the following:

import requests from bs4 import BeautifulSoup import re query = "http://www.nature.com/search?journal=nature&order=date_desc" for page in range (1, 10): req = requests.get(query + "&page=" + str(page)) soup = BeautifulSoup(req.text) cards = soup.findAll("li", "mb20 card cleared") matches = re.findall('mb20 card cleared', req.text) print(len(cards), len(matches))

I expect Beautifulsoup to print "25" (the number of search results) 10 times (one for every page) but it doesn't. Instead, it prints:

14, 25 12, 25 25, 25 15, 25 15, 25 17, 25 17, 25 15, 25 14, 25

Looking at the html source shows that there should be 25 results returned per page but Beautifulsoup seems to be confused here and I can't figure out why.

Update 1 In case it matters, I'm running on Mac OSX Mavericks using Anaconda Python 2.7.10 and bs4 version 4.3.1

Update 2 I added a regex to show that req.text does indeed contain what I'm looking for but beautifulsoup is not finding it

Update 3 When I run this simple script multiple times, I sometimes get a "Segmentation fault: 11". Not sure why

最满意答案

BeautifulSoup使用的解析器之间存在差异 。

如果你没有明确指定解析器, BeautifulSoup会根据等级选择一个 :

如果你没有指定任何东西,你会得到安装的最好的HTML解析器。 Beautiful Soup将lxml的解析器评为最好,然后是html5lib,然后是Python的内置解析器。

显式指定解析器:

soup = BeautifulSoup(data, 'html5lib') soup = BeautifulSoup(data, 'html.parser') soup = BeautifulSoup(data, 'lxml')

There are differences between the parsers used by BeautifulSoup under-the-hood.

If you don't specify a parser explicitly, BeautifulSoup would choose the one based on rank:

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

Specify the parser explicitly:

soup = BeautifulSoup(data, 'html5lib') soup = BeautifulSoup(data, 'html.parser') soup = BeautifulSoup(data, 'lxml')

更多推荐

本文发布于:2023-04-28 03:55:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1329830.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:找不到   Beautifulsoup   FindAll

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!