admin管理员组

文章数量:1610842

1.问题:

Python用open()打开文件,读取其中内容时,报错说编码错误,'utf-8' 编码不能给字节0xed编码。

feed LM input feed file: ./data/raw/21000101.204243.txt
Traceback (most recent call last):
  File "run.py", line 9, in <module>
    traindata = load_data_in_cache()
  File "/data/deploy/wang/bertt/bigdata/feedrec/LM_embedding/gen_sample.py", line 20, in load_data_in_cache
    for line in input:
  File "/home/op_dev/wang/py3.6.12/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7014: invalid continuation byte

另外一次报错:

feed LM input feed file: ./data/raw/21000101.210302.txt
Traceback (most recent call last):
  File "run.py", line 9, in <module>
    traindata = load_data_in_cache()
  File "/data/deploy/wang/bertt/bigdata/feedrec/LM_embedding/gen_sample.py", line 20, in load_data_in_cache
    for line in input:
  File "/home/op_dev/wang/py3.6.12/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 2824: invalid continuation byte

报错代码块:

    for input_feeds_file in file_path:
        with open(input_feeds_file) as input:
            for line in input:
                line = line.strip()
                ......

2.问题原因:

这是编码解码的问题,这个错误就是‘utf-8’不能解码位置2824的那个字节(0xed),也就是这个字节超出了utf-8的表示范围了.
换句话说,内容读取的时候发现了文件中存在utf-8不可编译的内容,所以我需要使用一种encoding来使文件能够被正常读取。

3.解决方法:

先看一步:在open()参数中增加了:encoding='unicode_escape',解决上面的问题

    for input_feeds_file in file_path:
        with open(input_feeds_file, encoding='unicode_escape') as input:
            for line in input:
                line = line.strip()
                ......

又报了另外一个错误:

feed LM input feed file: ./data/raw/21000101.210302.txt
Traceback (most recent call last):
  File "run.py", line 9, in <module>
    traindata = load_data_in_cache()
  File "/data/deploy/wang/bertt/bigdata/feedrec/LM_embedding/gen_sample.py", line 20, in load_data_in_cache
    for line in input:
  File "/home/op_dev/wang/py3.6.12/lib/python3.6/encodings/unicode_escape.py", line 26, in decode
    return codecs.unicode_escape_decode(input, self.errors)[0]
UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 8191: \ at end of string

问题原因:'unicodeescape'不能解码8191位置的0x5c.

查询了一下:要想彻底解决编码问题,直接用 encoding='ISO-8859-1',目前不曾报错。
参考:

1.Unicode、UTF-8 和 ISO8859-1到底有什么区别:https://blog.csdn/robertcpp/article/details/7837712

本文标签: 文件CodecDecodeUTFPython