考虑我需要从stdin读取XML文件。 如何正确编码?
目前我只是这样做
xmlString = sys.stdin.read() doc = xml.dom.minidom.parseString(xmlString)显然xmlString并不总是被正确解码,导致误解的字符。
是否有可能解决这个问题,或者我必须忍受我从stdin “按原样”得到的东西?
编辑:假设通过stdin提供的文件是具有适当XML声明的SVG文件,例如考虑,这是保存的
<?xml version="1.0" encoding="UTF-8"> <svg xmlns="http://www.w3.org/2000/svg"> <desc>ú</desc> </svg>这意味着encoding属性可用于检测编码(但显然我必须至少读取第一行),然后我将不得不以某种方式使用检测到的编码调整stdin读取。
Consider I need to read an XML file from stdin. How do I get the encoding right?
Currently I simply do
xmlString = sys.stdin.read() doc = xml.dom.minidom.parseString(xmlString)Apparently xmlString is not always properly decoded resulting in misinterpreted characters.
Is there a possibility to fix this or do I have to live with what I get from stdin "as is"?
Edit: It's save to assume that the file provided via stdin is an SVG file with proper XML declaration, e.g. consider
<?xml version="1.0" encoding="UTF-8"> <svg xmlns="http://www.w3.org/2000/svg"> <desc>ú</desc> </svg>That means the encoding attribute can be used to detect encoding (but obviously I have to read at least the first line for that) and afterwards I would have to somehow adjust reading from stdin using the detected encoding.
最满意答案
如果每次文件编码可能不同,并且它完全未知,您可以使用chardet库来猜测它的编码。 请注意,它使用一些统计数据来找到最佳匹配,因此它并不完美。
如果您知道数据的编码,则有两种选择。 PYTHONIOENCODING变量 ,或使用str.decode 。
The solution was most easy in this case. With
try: input = sys.stdin.buffer except AttributeError: input = sys.stdin xmlString = input.read() doc = xml.dom.minidom.parseString(xmlString)stdin is opened as a binary stream (i.e. not decoded). In my particular case the XML parser handles the decoding on its own just fine, making any effort from my side unnecessary.
Note that Python 3 opens stdin in text mode (decoded) by default but obviously with the wrong character encoding in many cases. Therefore the buffer attribute is necessary to access the underlying binary stream. The exception handling is necessary because this is the default behavior in earlier versions of Python and the buffer attribute is not available.
更多推荐
发布评论