使用BeautifulSoup根据内容值提取标签内容(Extracting tag content based on content value using BeautifulSoup)

编程入门 行业动态 更新时间:2024-10-23 03:24:27
使用BeautifulSoup根据内容值提取标签内容(Extracting tag content based on content value using BeautifulSoup)

我有以下格式的Html文档。

<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>

我想提取段落标签的内容,包括斜体和粗体标签的内容,但不包括锚标签的内容。 另外,可能在开始时忽略数字。

预期的输出是:段落的内容用斜体但不强。

什么是最好的方式来做到这一点?

此外,下面的代码片段返回TypeError:类型'NoneType'的参数不可迭代

soup = BSoup(page) for p in soup.findAll('p'): if '&nbsp;&nbsp;&nbsp;' in p.string: print p

感谢您的建议。

I have a Html document of the following format.

<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>

I want to extract the content of paragraph tag, including the content of italic and bold tag but not the content of anchor tag. Also, possible ignoring the Numeric in the beginning.

The expected output is: Content of the paragraph in italic but not strong.

What is the best way to do it?

Also, the following code snippet returns TypeError: argument of type 'NoneType' is not iterable

soup = BSoup(page) for p in soup.findAll('p'): if '&nbsp;&nbsp;&nbsp;' in p.string: print p

Thanks for the suggestions.

最满意答案

您的代码失败,因为如果标记只有一个子tag.string并且该子tag.string是tag.string则设置tag.string

您可以通过提取标签来实现您想要的a :

from BeautifulSoup import BeautifulSoup s = """<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>""" soup = BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES) for p in soup.findAll('p'): for a in p.findAll('a'): a.extract() print ''.join(p.findAll(text=True))

Your code fails because tag.string is set if the tag has only one child and that child is NavigableString

You can achieve what you want by extracting the a tag:

from BeautifulSoup import BeautifulSoup s = """<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>""" soup = BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES) for p in soup.findAll('p'): for a in p.findAll('a'): a.extract() print ''.join(p.findAll(text=True))

更多推荐

本文发布于:2023-08-04 15:45:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1417667.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:内容   标签   BeautifulSoup   Extracting   based

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!