如何检测哪种类型的中文编码有文本文件？(How to detect which type of chinese encoding has text file?)

在http://www.gnu.org/software/libiconv/上有20种类型的中文编码：

中文EUC-CN，HZ，GBK，CP936，GB18030，EUC-TW，BIG5，CP950，BIG5-HKSCS，BIG5-HKSCS：2004，BIG5-HKSCS：2001，BIG5-HKSCS：1999，ISO-2022-CN，ISO -2022-CN-EXT

所以我有一个不是UTF-8的文本文件。它是ASCII。我想用iconv()将它转换为UTF-8。但为此，我需要知道源的字符编码。

如果我不懂中文，该怎么办？ :(

我注意到：

$str = iconv('GB18030', 'UTF-8', $str); file_put_contents('file.txt', $str);

产生一个UTF-8编码文件，而我尝试过的其他编码（CP950，GBK和EUC-CN）产生一个ASCII文件。这是否意味着iconv能够检测给定字符串的输入编码是否错误？

on http://www.gnu.org/software/libiconv/ there are like 20 types of encoding for Chinese:

Chinese EUC-CN, HZ, GBK, CP936, GB18030, EUC-TW, BIG5, CP950, BIG5-HKSCS, BIG5-HKSCS:2004, BIG5-HKSCS:2001, BIG5-HKSCS:1999, ISO-2022-CN, ISO-2022-CN-EXT

So I have a text file that is not UTF-8. It's ASCII. And I want to convert it to UTF-8 using iconv(). But for that I need to know the character encoding of the source.

How can I do that if I don't know chinese? :(

I noticed that:

$str = iconv('GB18030', 'UTF-8', $str); file_put_contents('file.txt', $str);

produces an UTF-8 encoded file, while other encodings I tried (CP950, GBK and EUC-CN) produced an ASCII file. Could that mean that iconv is able to detect if the input encoding is wrong for the given string?

最满意答案

这可能适合您的需求（但我真的不能告诉）。设置区域设置和utf8_decode，并使用mb_check_encoding而不是mt_detect_encoding似乎给出了一些有用的输出。

// some text from http://chinesenotes.com/chinese_text_l10n.php // have tried both as string and content loaded from a file $chinese = '譧躆礛簼繰剆坲姏潧騔鯬跠瘱瘵瘲忁曨曣蛃袚觙'; $chinese=utf8_decode($chinese); $chinese_encodings ='EUC-CN,HZ,GBK,CP936,GB18030,EUC-TW,BIG5,CP950,BIG5-HKSCS,BIG5-HKSCS:2004,BIG5-HKSCS:2001,BIG5-HKSCS:1999,ISO-2022-CN,ISO-2022-CN-EXT'; $encodings = explode(',',$chinese_encodings); //set chinese locale setlocale(LC_CTYPE, 'Chinese'); foreach($encodings as $encoding) { if (@mb_check_encoding($chinese, $encoding)) { echo 'The string seems to be compatible with '.$encoding.'<br>'; } else { echo 'Not compatible with '.$encoding.'<br>'; } }

输出

The string seems to be compatible with EUC-CN The string seems to be compatible with HZ The string seems to be compatible with GBK The string seems to be compatible with CP936 Not compatible with GB18030 The string seems to be compatible with EUC-TW The string seems to be compatible with BIG5 The string seems to be compatible with CP950 Not compatible with BIG5-HKSCS Not compatible with BIG5-HKSCS:2004 Not compatible with BIG5-HKSCS:2001 Not compatible with BIG5-HKSCS:1999 Not compatible with ISO-2022-CN Not compatible with ISO-2022-CN-EXT

这是总猜测。现在至少似乎认识到了一些中文编码。如果它是全部垃圾，请删除。

This may work for your needs (but I really cant tell). Setting the locale and utf8_decode, and using mb_check_encoding instead of mt_detect_encoding seems to give some useful output..

// some text from http://chinesenotes.com/chinese_text_l10n.php // have tried both as string and content loaded from a file $chinese = '譧躆礛簼繰剆坲姏潧騔鯬跠瘱瘵瘲忁曨曣蛃袚觙'; $chinese=utf8_decode($chinese); $chinese_encodings ='EUC-CN,HZ,GBK,CP936,GB18030,EUC-TW,BIG5,CP950,BIG5-HKSCS,BIG5-HKSCS:2004,BIG5-HKSCS:2001,BIG5-HKSCS:1999,ISO-2022-CN,ISO-2022-CN-EXT'; $encodings = explode(',',$chinese_encodings); //set chinese locale setlocale(LC_CTYPE, 'Chinese'); foreach($encodings as $encoding) { if (@mb_check_encoding($chinese, $encoding)) { echo 'The string seems to be compatible with '.$encoding.'<br>'; } else { echo 'Not compatible with '.$encoding.'<br>'; } }

outputs

The string seems to be compatible with EUC-CN The string seems to be compatible with HZ The string seems to be compatible with GBK The string seems to be compatible with CP936 Not compatible with GB18030 The string seems to be compatible with EUC-TW The string seems to be compatible with BIG5 The string seems to be compatible with CP950 Not compatible with BIG5-HKSCS Not compatible with BIG5-HKSCS:2004 Not compatible with BIG5-HKSCS:2001 Not compatible with BIG5-HKSCS:1999 Not compatible with ISO-2022-CN Not compatible with ISO-2022-CN-EXT

It is total guess. Now it at least seems to recognise some of the chinese encodings. Delete if it is total junk.

更多推荐