我有一个脚本需要在被lxml.HTML()解析之前确定字符集。 如果找不到它,我会假定ISO-8859-1(这是正常的假设字符集?),并且搜索带有charset属性的元标记的html。 不过,我不确定最好的方式来做到这一点。 我可以尝试使用lxml创建一个etree,但我不想读取整个文件,因为我可能遇到编码问题。 但是,如果我没有阅读整个文件,我不能创建一个etree,因为一些标签不会被关闭。
我是否应该找到带有一些花哨的字符串下标的meta标签,一旦找到或读取了一定数量的行,就会跳出循环? 也许使用低级HTML解析器,例如html.parser? 使用python3顺便说一句,谢谢。
I have a script that needs to determine the charset before being read by lxml.HTML() for parsing. I will assume ISO-8859-1(that's the normal assumed charset for this right?) if it can't be found and search the html for the meta tag with the charset attribute. However I'm not sure the best way to do that. I could try to create an etree with lxml, but I don't want to read the whole file since I may run into encoding problems. However, if I don't read the whole file I can't build an etree since some tags will not have been closed.
Should I just find the meta tag with some fancy string subscripting and break out of the loop once it's found or a certain number of lines have been read? Maybe use a low level HTML parser, eg html.parser? Using python3 btw, thanks.
最满意答案
您应该首先尝试从HTTP标头中提取编码。 如果它不存在,你应该用lxml解析它。 这可能会很棘手,因为如果charset不匹配,lxml会抛出解析错误。 解决方法是解码和编码忽略未知字符的数据。
html_data=html_data.decode("UTF-8","ignore") html_data=html_data.encode("UTF-8","ignore")在此之后,您可以通过使用utf-8编码调用lxml.HTML()命令来进行解析。 这样,您将能够找到HTML标头中定义的正确编码。
在找到编码之后,你必须用适当的编码重新解析HTML文档。
不幸的是,有时候甚至在HTML头文件中也可能找不到字符编码。 我建议你只有在这些步骤失败后才能使用chardet模块来查找正确的编码。
You should first try to extract encoding from HTTP headers. If it is not present there, you should parse it with the lxml. This might be tricky since lxml throws parse errors if charset does not match. A work-around would be decoding and encoding the data ignoring the unknown characters.
html_data=html_data.decode("UTF-8","ignore") html_data=html_data.encode("UTF-8","ignore")After this, you can parse by invoking the lxml.HTML() command with utf-8 encoding. This way, you'll be able to find the correct encoding defined in the HTML headers.
After finding the encoding, you'll have to re-parse the HTML document with proper encoding.
Unfortunately, sometimes you might not find character encoding even in the HTML headers. I'd suggest you using the chardet module to find the proper encoding only after these steps fail.
更多推荐
发布评论