我试图使用BeautifulSoup从网页中的<p>元素中删除所有内部html。 有内部标签,但我不在乎,我只想获得内部文本。
例如,对于:
<p>Red</p> <p><i>Blue</i></p> <p>Yellow</p> <p>Light <b>green</b></p>我如何提取:
Red Blue Yellow Light green.string和.contents[0]都没有做我所需要的。 .extract()也不是。因为我不想事先指定内部标签 - 我想处理任何可能发生的事情。
在BeautifulSoup中是否存在'仅获取可见的HTML'类型的方法?
---- ------ UPDATE
建议,尝试:
soup = BeautifulSoup(open("test.html")) p_tags = soup.findAll('p',text=True) for i, p_tag in enumerate(p_tags): print str(i) + p_tag但是这没有帮助 - 它打印出来:
0Red 1 2Blue 3 4Yellow 5 6Light 7green 8I'm trying to scrape all the inner html from the <p> elements in a web page using BeautifulSoup. There are internal tags, but I don't care, I just want to get the internal text.
For example, for:
<p>Red</p> <p><i>Blue</i></p> <p>Yellow</p> <p>Light <b>green</b></p>How can I extract:
Red Blue Yellow Light greenNeither .string nor .contents[0] does what I need. Nor does .extract(), because I don't want to have to specify the internal tags in advance - I want to deal with any that may occur.
Is there a 'just get the visible HTML' type of method in BeautifulSoup?
----UPDATE------
On advice, trying:
soup = BeautifulSoup(open("test.html")) p_tags = soup.findAll('p',text=True) for i, p_tag in enumerate(p_tags): print str(i) + p_tagBut that doesn't help - it prints out:
0Red 1 2Blue 3 4Yellow 5 6Light 7green 8最满意答案
简短的回答: soup.findAll(text=True)
这已经在StackOverflow和BeautifulSoup文档中得到了解答。
更新:
为了澄清,一段代码:
>>> txt = """\ <p>Red</p> <p><i>Blue</i></p> <p>Yellow</p> <p>Light <b>green</b></p> """ >>> import BeautifulSoup >>> BeautifulSoup.__version__ '3.0.7a' >>> soup = BeautifulSoup.BeautifulSoup(txt) >>> for node in soup.findAll('p'): print ''.join(node.findAll(text=True)) Red Blue Yellow Light greenShort answer: soup.findAll(text=True)
This has already been answered, here on StackOverflow and in the BeautifulSoup documentation.
UPDATE:
To clarify, a working piece of code:
>>> txt = """\ <p>Red</p> <p><i>Blue</i></p> <p>Yellow</p> <p>Light <b>green</b></p> """ >>> import BeautifulSoup >>> BeautifulSoup.__version__ '3.0.7a' >>> soup = BeautifulSoup.BeautifulSoup(txt) >>> for node in soup.findAll('p'): print ''.join(node.findAll(text=True)) Red Blue Yellow Light green更多推荐
发布评论