使用BeautifulSoup根据内容值提取标签内容(Extracting tag content based on content value using BeautifulSoup)

编程入门行业动态更新时间:2024-10-23 03:24:27

我有以下格式的Html文档。

1. Content of the paragraph in italic but not strong <a href="url">ignore</a>.

我想提取段落标签的内容，包括斜体和粗体标签的内容，但不包括锚标签的内容。另外，可能在开始时忽略数字。

预期的输出是：段落的内容用斜体但不强。

什么是最好的方式来做到这一点？

此外，下面的代码片段返回TypeError：类型'NoneType'的参数不可迭代

soup = BSoup(page) for p in soup.findAll('p'): if '   ' in p.string: print p

感谢您的建议。

I have a Html document of the following format.

1. Content of the paragraph in italic but not strong <a href="url">ignore</a>.

I want to extract the content of paragraph tag, including the content of italic and bold tag but not the content of anchor tag. Also, possible ignoring the Numeric in the beginning.

The expected output is: Content of the paragraph in italic but not strong.

What is the best way to do it?

Also, the following code snippet returns TypeError: argument of type 'NoneType' is not iterable

soup = BSoup(page) for p in soup.findAll('p'): if '   ' in p.string: print p

Thanks for the suggestions.

最满意答案

您的代码失败，因为如果标记只有一个子tag.string并且该子tag.string是tag.string则设置tag.string

您可以通过提取标签来实现您想要的a ：

from BeautifulSoup import BeautifulSoup s = """   1. Content of the paragraph in italic but not strong <a href="url">ignore</a>.""" soup = BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES) for p in soup.findAll('p'): for a in p.findAll('a'): a.extract() print ''.join(p.findAll(text=True))

Your code fails because tag.string is set if the tag has only one child and that child is NavigableString

You can achieve what you want by extracting the a tag:

更多推荐

本文发布于:2023-08-04 15:45:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1417667.html