BeautifulSoup解析的问题(Issues with BeautifulSoup parsing)

编程入门 行业动态 更新时间:2024-10-26 21:24:51
BeautifulSoup解析的问题(Issues with BeautifulSoup parsing)

我试图用BeautifulSoup解析html页面,但看起来BeautifulSoup根本不喜欢html或那个页面。 当我运行下面的代码时,prettify()方法只返回页面的脚本块(参见下文)。 有人有一个想法,为什么会发生?

import urllib2 from BeautifulSoup import BeautifulSoup url = "http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1" html = "".join(urllib2.urlopen(url).readlines()) print "-- HTML ------------------------------------------" print html print "-- BeautifulSoup ---------------------------------" print BeautifulSoup(html).prettify()

这是由BeautifulSoup生成的输出。

-- BeautifulSoup --------------------------------- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <script language="JavaScript"> <!-- function highlight(img) { document[img].src = "/marketing/sony/images/en/" + img + "_on.gif"; } function unhighlight(img) { document[img].src = "/marketing/sony/images/en/" + img + "_off.gif"; } //--> </script>

谢谢!

更新:我正在使用以下版本,这似乎是最新的版本。

__author__ = "Leonard Richardson (leonardr@segfault.org)" __version__ = "3.1.0.1" __copyright__ = "Copyright (c) 2004-2009 Leonard Richardson" __license__ = "New-style BSD"

I am trying to parse an html page with BeautifulSoup, but it appears that BeautifulSoup doesn't like the html or that page at all. When I run the code below, the method prettify() returns me only the script block of the page (see below). Does anybody has an idea why it happens?

import urllib2 from BeautifulSoup import BeautifulSoup url = "http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1" html = "".join(urllib2.urlopen(url).readlines()) print "-- HTML ------------------------------------------" print html print "-- BeautifulSoup ---------------------------------" print BeautifulSoup(html).prettify()

The is the output produced by BeautifulSoup.

-- BeautifulSoup --------------------------------- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <script language="JavaScript"> <!-- function highlight(img) { document[img].src = "/marketing/sony/images/en/" + img + "_on.gif"; } function unhighlight(img) { document[img].src = "/marketing/sony/images/en/" + img + "_off.gif"; } //--> </script>

Thanks!

UPDATE: I am using the following version, which appears to be the latest.

__author__ = "Leonard Richardson (leonardr@segfault.org)" __version__ = "3.1.0.1" __copyright__ = "Copyright (c) 2004-2009 Leonard Richardson" __license__ = "New-style BSD"

最满意答案

像Łukasz建议的那样使用3.0.7a版本。 BeautifulSoup 3.1被设计为与Python 3.0兼容,因此他们必须将解析器从SGMLParser更改为HTMLParser,而HTMLParser似乎更容易受到错误的HTML影响。

从BeautifulSoup 3.1的更新日志 :

“美丽的汤现在基于HTMLParser而不是SGMLParser,它已经在Python 3中消失了.SGMLParser处理了一些错误的HTML,但HTMLParser没有”

Try with version 3.0.7a as Łukasz suggested. BeautifulSoup 3.1 was designed to be compatible with Python 3.0 so they had to change the parser from SGMLParser to HTMLParser which seems more vulnerable to bad HTML.

From the changelog for BeautifulSoup 3.1:

"Beautiful Soup is now based on HTMLParser rather than SGMLParser, which is gone in Python 3. There's some bad HTML that SGMLParser handled but HTMLParser doesn't"

更多推荐

本文发布于:2023-07-08 00:58:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1070185.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:BeautifulSoup   Issues   parsing

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!