我有以下格式的Html文档。
<p> 1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>我想提取段落标签的内容,包括斜体和粗体标签的内容,但不包括锚标签的内容。 另外,可能在开始时忽略数字。
预期的输出是:段落的内容用斜体但不强。
什么是最好的方式来做到这一点?
此外,下面的代码片段返回TypeError:类型'NoneType'的参数不可迭代
soup = BSoup(page) for p in soup.findAll('p'): if ' ' in p.string: print p感谢您的建议。
I have a Html document of the following format.
<p> 1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>I want to extract the content of paragraph tag, including the content of italic and bold tag but not the content of anchor tag. Also, possible ignoring the Numeric in the beginning.
The expected output is: Content of the paragraph in italic but not strong.
What is the best way to do it?
Also, the following code snippet returns TypeError: argument of type 'NoneType' is not iterable
soup = BSoup(page) for p in soup.findAll('p'): if ' ' in p.string: print pThanks for the suggestions.
最满意答案
您的代码失败,因为如果标记只有一个子tag.string并且该子tag.string是tag.string则设置tag.string
您可以通过提取标签来实现您想要的a :
from BeautifulSoup import BeautifulSoup s = """<p> 1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>""" soup = BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES) for p in soup.findAll('p'): for a in p.findAll('a'): a.extract() print ''.join(p.findAll(text=True))Your code fails because tag.string is set if the tag has only one child and that child is NavigableString
You can achieve what you want by extracting the a tag:
from BeautifulSoup import BeautifulSoup s = """<p> 1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>""" soup = BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES) for p in soup.findAll('p'): for a in p.findAll('a'): a.extract() print ''.join(p.findAll(text=True))更多推荐
发布评论