如何检测哪种类型的中文编码有文本文件?(How to detect which type of chinese encoding has text file?)

编程入门 行业动态 更新时间:2024-10-27 18:33:16
如何检测哪种类型的中文编码有文本文件?(How to detect which type of chinese encoding has text file?)

在http://www.gnu.org/software/libiconv/上有20种类型的中文编码:

中文EUC-CN,HZ,GBK,CP936,GB18030,EUC-TW,BIG5,CP950,BIG5-HKSCS,BIG5-HKSCS:2004,BIG5-HKSCS:2001,BIG5-HKSCS:1999,ISO-2022-CN,ISO -2022-CN-EXT

所以我有一个不是UTF-8的文本文件。 它是ASCII。 我想用iconv()将它转换为UTF-8。 但为此,我需要知道源的字符编码。

如果我不懂中文,该怎么办? :(

我注意到:

$str = iconv('GB18030', 'UTF-8', $str); file_put_contents('file.txt', $str);

产生一个UTF-8编码文件,而我尝试过的其他编码(CP950,GBK和EUC-CN)产生一个ASCII文件。 这是否意味着iconv能够检测给定字符串的输入编码是否错误?

on http://www.gnu.org/software/libiconv/ there are like 20 types of encoding for Chinese:

Chinese EUC-CN, HZ, GBK, CP936, GB18030, EUC-TW, BIG5, CP950, BIG5-HKSCS, BIG5-HKSCS:2004, BIG5-HKSCS:2001, BIG5-HKSCS:1999, ISO-2022-CN, ISO-2022-CN-EXT

So I have a text file that is not UTF-8. It's ASCII. And I want to convert it to UTF-8 using iconv(). But for that I need to know the character encoding of the source.

How can I do that if I don't know chinese? :(

I noticed that:

$str = iconv('GB18030', 'UTF-8', $str); file_put_contents('file.txt', $str);

produces an UTF-8 encoded file, while other encodings I tried (CP950, GBK and EUC-CN) produced an ASCII file. Could that mean that iconv is able to detect if the input encoding is wrong for the given string?

最满意答案

可能适合您的需求(但我真的不能告诉)。 设置区域设置和utf8_decode,并使用mb_check_encoding而不是mt_detect_encoding似乎给出了一些有用的输出。

// some text from http://chinesenotes.com/chinese_text_l10n.php // have tried both as string and content loaded from a file $chinese = '譧躆 礛簼繰 剆坲姏 潧 騔鯬 跠 瘱瘵瘲 忁曨曣 蛃袚觙'; $chinese=utf8_decode($chinese); $chinese_encodings ='EUC-CN,HZ,GBK,CP936,GB18030,EUC-TW,BIG5,CP950,BIG5-HKSCS,BIG5-HKSCS:2004,BIG5-HKSCS:2001,BIG5-HKSCS:1999,ISO-2022-CN,ISO-2022-CN-EXT'; $encodings = explode(',',$chinese_encodings); //set chinese locale setlocale(LC_CTYPE, 'Chinese'); foreach($encodings as $encoding) { if (@mb_check_encoding($chinese, $encoding)) { echo 'The string seems to be compatible with '.$encoding.'<br>'; } else { echo 'Not compatible with '.$encoding.'<br>'; } }

输出

The string seems to be compatible with EUC-CN The string seems to be compatible with HZ The string seems to be compatible with GBK The string seems to be compatible with CP936 Not compatible with GB18030 The string seems to be compatible with EUC-TW The string seems to be compatible with BIG5 The string seems to be compatible with CP950 Not compatible with BIG5-HKSCS Not compatible with BIG5-HKSCS:2004 Not compatible with BIG5-HKSCS:2001 Not compatible with BIG5-HKSCS:1999 Not compatible with ISO-2022-CN Not compatible with ISO-2022-CN-EXT

这是总猜测。 现在至少似乎认识到了一些中文编码。 如果它是全部垃圾,请删除。

This may work for your needs (but I really cant tell). Setting the locale and utf8_decode, and using mb_check_encoding instead of mt_detect_encoding seems to give some useful output..

// some text from http://chinesenotes.com/chinese_text_l10n.php // have tried both as string and content loaded from a file $chinese = '譧躆 礛簼繰 剆坲姏 潧 騔鯬 跠 瘱瘵瘲 忁曨曣 蛃袚觙'; $chinese=utf8_decode($chinese); $chinese_encodings ='EUC-CN,HZ,GBK,CP936,GB18030,EUC-TW,BIG5,CP950,BIG5-HKSCS,BIG5-HKSCS:2004,BIG5-HKSCS:2001,BIG5-HKSCS:1999,ISO-2022-CN,ISO-2022-CN-EXT'; $encodings = explode(',',$chinese_encodings); //set chinese locale setlocale(LC_CTYPE, 'Chinese'); foreach($encodings as $encoding) { if (@mb_check_encoding($chinese, $encoding)) { echo 'The string seems to be compatible with '.$encoding.'<br>'; } else { echo 'Not compatible with '.$encoding.'<br>'; } }

outputs

The string seems to be compatible with EUC-CN The string seems to be compatible with HZ The string seems to be compatible with GBK The string seems to be compatible with CP936 Not compatible with GB18030 The string seems to be compatible with EUC-TW The string seems to be compatible with BIG5 The string seems to be compatible with CP950 Not compatible with BIG5-HKSCS Not compatible with BIG5-HKSCS:2004 Not compatible with BIG5-HKSCS:2001 Not compatible with BIG5-HKSCS:1999 Not compatible with ISO-2022-CN Not compatible with ISO-2022-CN-EXT

It is total guess. Now it at least seems to recognise some of the chinese encodings. Delete if it is total junk.

更多推荐

本文发布于:2023-07-28 23:39:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1310257.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:中文   文本文件   哪种类型   detect   type

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!