如何使用Python为现有HTML添加一致的空格？(How can I add consistent whitespace to existing HTML using Python?)

编程入门行业动态更新时间:2024-10-27 06:26:36

我刚刚开始在一个网页上工作，这个网站上的所有HTML都在一行中，这对于阅读和使用来说真的很痛苦。我正在寻找一个工具（最好是一个Python库），它将接受HTML输入并返回相同的HTML，除了添加换行符和适当的缩进。（所有标签，标记和内容都应保持不变。）

该库不必处理格式错误的HTML; 我首先通过html5lib传递HTML，因此它将获得格式良好的HTML。但是，如上所述，我宁愿它不改变任何实际的标记本身; 我相信html5lib，宁愿让它处理正确性方面。

首先，有没有人知道这是否可能与html5lib？（不幸的是，他们的文档看起来有点稀疏。）如果没有，你会建议使用什么工具？我见过有人推荐HTML Tidy，但我不确定它是否可以配置为只改变空格。（除非插入空格，否则它会执行任何操作，如果它通过格式良好的HTML开始？）

I just started working on a website that is full of pages with all their HTML on a single line, which is a real pain to read and work with. I'm looking for a tool (preferably a Python library) that will take HTML input and return the same HTML unchanged, except for adding linebreaks and appropriate indentation. (All tags, markup, and content should be untouched.)

The library doesn't have to handle malformed HTML; I'm passing the HTML through html5lib first, so it will be getting well-formed HTML. However, as mentioned above, I would rather it didn't change any of the actual markup itself; I trust html5lib and would rather let it handle the correctness aspect.

First, does anyone know if this is possible with just html5lib? (Unfortunately, their documentation seems a bit sparse.) If not, what tool would you suggest? I've seen some people recommend HTML Tidy, but I'm not sure if it can be configured to only change whitespace. (Would it do anything except insert whitespace if it were passed well-formed HTML to start with?)

最满意答案

算法

将html解析为一些表示将表示序列化为html

使用BeautifulSoup树构建器的示例html5lib解析器

#!/usr/bin/env python from html5lib import HTMLParser, treebuilders parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup")) c = """<HTML><HEAD><TITLE>Title</TITLE></HEAD><BODY>...... </BODY></HTML>""" soup = parser.parse(c) print soup.prettify()

输出：

<html> <head> <title> Title </title> </head> <body> ...... </body> </html>