如何使用PHP检测CP437

编程入门 行业动态 更新时间:2024-10-24 01:56:35
本文介绍了如何使用PHP检测CP437的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我试图检测给定字符串的编码,以便稍后使用iconv将其转换为utf-8.我想将源编码集限制为utf8,iso8859-1,windows-1251,CP437

I am trying to detect the encoding of a given string in order to convert it later on to utf-8 using iconv. I want to restrict the set of source encodings to utf8, iso8859-1, windows-1251, CP437

//... $acceptedEncodings = array('utf-8', 'iso-8859-1', 'windows-1251' ); $srcEncoding = mb_detect_encoding($content, $acceptedEncodings, true); if($srcEncoding) { $content = iconv($srcEncoding, 'UTF-8', $content); } //...

问题是mb_detect_encoding似乎不接受CP437作为支持的编码,当我给它一个CP437编码的字符串时,它被分类为iso-8859-1,这会导致iconv忽略诸如ü之类的字符.

The problem is thet mb_detect_encoding does not seem to accept CP437 as a supported encoding and when I give it a CP437 encoded string this is classified as iso-8859-1 which causes iconv to ignore characters like ü.

我的问题是:有没有一种方法可以更早地检测到CP437编码?使用iconv从CP437转换为UTF-8效果很好,但我只是找不到检测CP437的正确方法.

My question is: Is there a way to detect CP437 encoding earlier? The conversion from CP437 to UTF-8 using iconv works fine but I just cannot find the proper way to detect CP437.

非常感谢您.

推荐答案

正如之前讨论过的无数次:从根本上讲,不可能将任何单字节编码与任何其他单字节编码区分开.您得到的是一堆字节.在编码A中,字节x42可能会映射到字符X,而在编码B中,相同的字节可能会映射到字符Y.但是,关于字节的blob,您所知道的没有什么,因为您只有字节.他们可以表示任何意思.它们在所有编码中均有效.可以识别更复杂的多字节编码(例如UTF-8),因为它们需要遵循更复杂的内部规则.因此,可以肯定地说出这不是无效有效的UTF-8 .但是,不可能100%肯定地说这绝对是UTF-8,而不是ISO-8859 .

As has been discussed countless times before: it is fundamentally impossible to distinguish any single-byte encoding from any other single-byte encoding. What you get are a bunch of bytes. In encoding A the byte x42 may map to character X and in encoding B the same byte may map to character Y. But nothing about the blob of bytes you have tells you that, because you only have the bytes. They can mean anything. They're equally valid in all encodings. It's possible to identify more complex multi-byte encodings like UTF-8, since they need to follow more complex internal rules. So it's possible to definitely be able to say This is not valid UTF-8. However, it is impossible to say with 100% certainty This is definitely UTF-8, not ISO-8859.

您需要具有有关接收到的内容的元数据,该数据可以告诉您内容的编码方式.事后对其进行识别是不切实际的.您需要进行实际的内容分析,以确定哪种编码对文本最有意义.

You need to have meta data about the content you receive which tells you what encoding the content is in. It's not practical to identify it after the fact. You'd need to employ actual content analysis to figure out which encoding a piece of text makes the most sense in.

更多推荐

如何使用PHP检测CP437

本文发布于:2023-11-10 06:26:11,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1574602.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:如何使用   PHP

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!