如何将Big5编码的txt文件转换为UTF8编码的txt文件？(How to convert a Big5 encoded txt file to UTF8 encoded txt file?)

编程入门行业动态更新时间:2024-10-27 16:38:55

我有一个Big5编码文件，Mac TextEdit无法打开它。我想知道如何将整个文件转换为utf8编码，因为utf8更加通用和普遍。

我已尝试在终端中使用iconv，但它不起作用。 Google也找不到任何有关此错误的信息。

$ iconv -f BIG5 -t UTF8 in.txt > out.txt iconv: in.txt:5:0: cannot convert

还有其他转换方式吗？

我从这里得到了txt文件，这是一个用台湾繁体中文写的中文名字列表。

I have a Big5 encoded file, which can't be opened by Mac TextEdit. I wonder how to convert the whole file into utf8 encoding, since utf8 is much more universal and common.

I have tried using iconv in my terminal, but it does not work. I can't find anything useful about this error by Google either.

$ iconv -f BIG5 -t UTF8 in.txt > out.txt iconv: in.txt:5:0: cannot convert

Are there any other ways to convert?

I got the txt file from here, whcih is a list of Chinese names writing in Taiwan Traditional Chinese.

最满意答案

查看文件的前20行，很明显编码使用字节0x8C作为某些多字节序列的第一个字节。具有此属性的编码为：

BIG5 BIG5-HKSCS CP932 CP936 CP949 CP950 GB18030 GBK 裘哈 SHIFT_JIS Shift_JISX0213

依次尝试：

$ for encoding in BIG5 BIG5-HKSCS CP932 CP936 CP949 CP950 GB18030 GBK \ JOHAB Shift_JIS Shift_JISX0213; do \ if head -n 20 < unique_names_2012.txt | iconv -f $encoding -t UTF-8 > /dev/null 2> /dev/null; then \ echo $encoding ; \ fi; \ done

使用GNU libiconv，它会打印出来

BIG5-HKSCS CP950 GB18030

是GB18030编码吗？

$ iconv -f GB18030 < unique_names_2012.txt

显示数百行使用PUA范围内的字符。虽然并非不可能，但似乎不太可能。

是CP950编码吗？

$ iconv -f CP950 < unique_names_2012.txt

在第2294行给出转换错误。

它是用BIG5-HKSCS编码的吗？

$ iconv -f BIG5-HKSCS < unique_names_2012.txt

在第713行给出转换错误。

因此，很可能该文件是以BIG5的变体编码的。有许多这样的变体，请参见http://haible.de/bruno/charsets/conversion-tables/Chinese.html 。可能它是CP950的扩展或BIG5-HKSCS的扩展（因为这些是今天BIG5系列中最流行的编码）。

总之，这种转换错误是由BIG5变体的非标准化增殖引起的。

您可以做的最好的事情是以UTF-8编码请求原始文件; 让发端人处理它。

Looking at the first 20 lines of your file, it is clear that the encoding uses the byte 0x8C as first byte of some multibyte sequences. The encodings that have this property are:

BIG5 BIG5-HKSCS CP932 CP936 CP949 CP950 GB18030 GBK JOHAB Shift_JIS Shift_JISX0213

Try them in turn:

With GNU libiconv, it prints

BIG5-HKSCS CP950 GB18030

Is it in GB18030 encoding?

$ iconv -f GB18030 < unique_names_2012.txt

shows hundreds of lines that use characters in the PUA range. While not impossible, it seems unlikely.

Is it in CP950 encoding?

$ iconv -f CP950 < unique_names_2012.txt

gives a conversion error at line 2294.

Is it in BIG5-HKSCS encoding?

$ iconv -f BIG5-HKSCS < unique_names_2012.txt

gives a conversion error at line 713.

So, most probably the file is encoded in a variant of BIG5. There are many such variants, see http://haible.de/bruno/charsets/conversion-tables/Chinese.html. Possibly it's an extension of CP950 or an extension of BIG5-HKSCS (since these are the most popular encodings from the BIG5 family today).

In summary, such conversion errors are caused by unstandardized proliferation of BIG5 variants.

The best thing you can do is to request the original file in UTF-8 encoding; let the originator deal with it.

更多推荐

本文发布于:2023-08-05 15:11:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1434732.html