我试图用BeautifulSoup解析html页面,但看起来BeautifulSoup根本不喜欢html或那个页面。 当我运行下面的代码时,prettify()方法只返回页面的脚本块(参见下文)。 有人有一个想法,为什么会发生?
import urllib2 from BeautifulSoup import BeautifulSoup url = "http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1" html = "".join(urllib2.urlopen(url).readlines()) print "-- HTML ------------------------------------------" print html print "-- BeautifulSoup ---------------------------------" print BeautifulSoup(html).prettify()这是由BeautifulSoup生成的输出。
-- BeautifulSoup --------------------------------- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <script language="JavaScript"> <!-- function highlight(img) { document[img].src = "/marketing/sony/images/en/" + img + "_on.gif"; } function unhighlight(img) { document[img].src = "/marketing/sony/images/en/" + img + "_off.gif"; } //--> </script>谢谢!
更新:我正在使用以下版本,这似乎是最新的版本。
__author__ = "Leonard Richardson (leonardr@segfault.org)" __version__ = "3.1.0.1" __copyright__ = "Copyright (c) 2004-2009 Leonard Richardson" __license__ = "New-style BSD"I am trying to parse an html page with BeautifulSoup, but it appears that BeautifulSoup doesn't like the html or that page at all. When I run the code below, the method prettify() returns me only the script block of the page (see below). Does anybody has an idea why it happens?
import urllib2 from BeautifulSoup import BeautifulSoup url = "http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1" html = "".join(urllib2.urlopen(url).readlines()) print "-- HTML ------------------------------------------" print html print "-- BeautifulSoup ---------------------------------" print BeautifulSoup(html).prettify()The is the output produced by BeautifulSoup.
-- BeautifulSoup --------------------------------- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <script language="JavaScript"> <!-- function highlight(img) { document[img].src = "/marketing/sony/images/en/" + img + "_on.gif"; } function unhighlight(img) { document[img].src = "/marketing/sony/images/en/" + img + "_off.gif"; } //--> </script>Thanks!
UPDATE: I am using the following version, which appears to be the latest.
__author__ = "Leonard Richardson (leonardr@segfault.org)" __version__ = "3.1.0.1" __copyright__ = "Copyright (c) 2004-2009 Leonard Richardson" __license__ = "New-style BSD"最满意答案
像Łukasz建议的那样使用3.0.7a版本。 BeautifulSoup 3.1被设计为与Python 3.0兼容,因此他们必须将解析器从SGMLParser更改为HTMLParser,而HTMLParser似乎更容易受到错误的HTML影响。
从BeautifulSoup 3.1的更新日志 :
“美丽的汤现在基于HTMLParser而不是SGMLParser,它已经在Python 3中消失了.SGMLParser处理了一些错误的HTML,但HTMLParser没有”
Try with version 3.0.7a as Łukasz suggested. BeautifulSoup 3.1 was designed to be compatible with Python 3.0 so they had to change the parser from SGMLParser to HTMLParser which seems more vulnerable to bad HTML.
From the changelog for BeautifulSoup 3.1:
"Beautiful Soup is now based on HTMLParser rather than SGMLParser, which is gone in Python 3. There's some bad HTML that SGMLParser handled but HTMLParser doesn't"
更多推荐
发布评论