admin管理员组文章数量:1610842
1.问题:
Python用open()打开文件,读取其中内容时,报错说编码错误,'utf-8' 编码不能给字节0xed编码。
feed LM input feed file: ./data/raw/21000101.204243.txt
Traceback (most recent call last):
File "run.py", line 9, in <module>
traindata = load_data_in_cache()
File "/data/deploy/wang/bertt/bigdata/feedrec/LM_embedding/gen_sample.py", line 20, in load_data_in_cache
for line in input:
File "/home/op_dev/wang/py3.6.12/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7014: invalid continuation byte
另外一次报错:
feed LM input feed file: ./data/raw/21000101.210302.txt
Traceback (most recent call last):
File "run.py", line 9, in <module>
traindata = load_data_in_cache()
File "/data/deploy/wang/bertt/bigdata/feedrec/LM_embedding/gen_sample.py", line 20, in load_data_in_cache
for line in input:
File "/home/op_dev/wang/py3.6.12/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 2824: invalid continuation byte
报错代码块:
for input_feeds_file in file_path:
with open(input_feeds_file) as input:
for line in input:
line = line.strip()
......
2.问题原因:
这是编码解码的问题,这个错误就是‘utf-8’不能解码位置2824的那个字节(0xed),也就是这个字节超出了utf-8的表示范围了.
换句话说,内容读取的时候发现了文件中存在utf-8不可编译的内容,所以我需要使用一种encoding来使文件能够被正常读取。
3.解决方法:
先看一步:在open()参数中增加了:encoding='unicode_escape',解决上面的问题
for input_feeds_file in file_path:
with open(input_feeds_file, encoding='unicode_escape') as input:
for line in input:
line = line.strip()
......
又报了另外一个错误:
feed LM input feed file: ./data/raw/21000101.210302.txt
Traceback (most recent call last):
File "run.py", line 9, in <module>
traindata = load_data_in_cache()
File "/data/deploy/wang/bertt/bigdata/feedrec/LM_embedding/gen_sample.py", line 20, in load_data_in_cache
for line in input:
File "/home/op_dev/wang/py3.6.12/lib/python3.6/encodings/unicode_escape.py", line 26, in decode
return codecs.unicode_escape_decode(input, self.errors)[0]
UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 8191: \ at end of string
问题原因:'unicodeescape'不能解码8191位置的0x5c.
查询了一下:要想彻底解决编码问题,直接用 encoding='ISO-8859-1',目前不曾报错。
参考:
1.Unicode、UTF-8 和 ISO8859-1到底有什么区别:https://blog.csdn/robertcpp/article/details/7837712
版权声明:本文标题:Python打开读文件:UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xed in position 7014: invalid conti 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:https://www.elefans.com/dongtai/1728606040a1165466.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论