我需要关闭所有打开的标签与正确的嵌套为了清理用户提交的HTML。我一直在寻找一种算法或Python code要做到这一点,但没有发现任何东西,除了一些半生不熟的实现在PHP中,等等。
I need to sanitize HTML submitted by the user by closing any open tags with correct nesting order. I have been looking for an algorithm or Python code to do this but haven't found anything except some half-baked implementations in PHP, etc.
例如,像
<p> <ul> <li>Foo
变为
<p> <ul> <li>Foo</li> </ul> </p>
任何帮助将是AP preciated:)
Any help would be appreciated :)
推荐答案使用BeautifulSoup:
using BeautifulSoup:
from BeautifulSoup import BeautifulSoup html = "<p><ul><li>Foo" soup = BeautifulSoup(html) print soup.prettify()
让你
<p> <ul> <li> Foo </li> </ul> </p>
据我所知,你无法控制将在&lt;立GT;&LT; /李&GT;标签从富单独的行。
As far as I know, you can't control putting the <li></li> tags on separate lines from Foo.
使用整洁:
import tidy html = "<p><ul><li>Foo" print tidy.parseString(html, show_body_only=True)
让你
<ul> <li>Foo</li> </ul>
不幸的是,据我所知,没有办法保持在&lt; P&GT;标签中的例子。整洁跨$ P $其中pts它作为一个空的段落,而不是一个未关闭的,这样算下来
Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing
print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)
出来为
<p></p> <ul> <li>Foo</li> </ul>
最终,当然,在&lt p为H.;在你的例子标签是多余的,所以你可能会被罚款与失去它。
Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.
最后,整洁也可以做缩进:
Finally, Tidy can also do indenting:
print tidy.parseString(html, show_body_only=True, indent=True)
变为
<ul> <li>Foo </li> </ul>
所有这些都起伏变化,但希望其中一人是足够接近。
All of these have their ups and downs, but hopefully one of them is close enough.
更多推荐
如何解决错误的嵌套/未关闭的HTML标记?
发布评论