正确编码从stdin读取的XML文件(Correct encoding for XML file read from stdin)

编程入门 行业动态 更新时间:2024-10-25 14:30:14
正确编码从stdin读取的XML文件(Correct encoding for XML file read from stdin)

考虑我需要从stdin读取XML文件。 如何正确编码?

目前我只是这样做

xmlString = sys.stdin.read() doc = xml.dom.minidom.parseString(xmlString)

显然xmlString并不总是被正确解码,导致误解的字符。

是否有可能解决这个问题,或者我必须忍受我从stdin “按原样”得到的东西?

编辑:假设通过stdin提供的文件是具有适当XML声明的SVG文件,例如考虑,这是保存的

<?xml version="1.0" encoding="UTF-8"> <svg xmlns="http://www.w3.org/2000/svg"> <desc>ú</desc> </svg>

这意味着encoding属性可用于检测编码(但显然我必须至少读取第一行),然后我将不得不以某种方式使用检测到的编码调整stdin读取。

Consider I need to read an XML file from stdin. How do I get the encoding right?

Currently I simply do

xmlString = sys.stdin.read() doc = xml.dom.minidom.parseString(xmlString)

Apparently xmlString is not always properly decoded resulting in misinterpreted characters.

Is there a possibility to fix this or do I have to live with what I get from stdin "as is"?

Edit: It's save to assume that the file provided via stdin is an SVG file with proper XML declaration, e.g. consider

<?xml version="1.0" encoding="UTF-8"> <svg xmlns="http://www.w3.org/2000/svg"> <desc>ú</desc> </svg>

That means the encoding attribute can be used to detect encoding (but obviously I have to read at least the first line for that) and afterwards I would have to somehow adjust reading from stdin using the detected encoding.

最满意答案

如果每次文件编码可能不同,并且它完全未知,您可以使用chardet库来猜测它的编码。 请注意,它使用一些统计数据来找到最佳匹配,因此它并不完美。

如果您知道数据的编码,则有两种选择。 PYTHONIOENCODING变量 ,或使用str.decode 。

The solution was most easy in this case. With

try: input = sys.stdin.buffer except AttributeError: input = sys.stdin xmlString = input.read() doc = xml.dom.minidom.parseString(xmlString)

stdin is opened as a binary stream (i.e. not decoded). In my particular case the XML parser handles the decoding on its own just fine, making any effort from my side unnecessary.

Note that Python 3 opens stdin in text mode (decoded) by default but obviously with the wrong character encoding in many cases. Therefore the buffer attribute is necessary to access the underlying binary stream. The exception handling is necessary because this is the default behavior in earlier versions of Python and the buffer attribute is not available.

更多推荐

本文发布于:2023-08-07 03:58:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1459575.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:正确   文件   XML   stdin   read

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!