如何检测和报告使用Perl进行交换不合法的Unicode代码点？(How can I detect and report Unicode code points that aren't le

编程入门行业动态更新时间:2024-10-27 10:26:46

如何检测和报告使用Perl进行交换不合法的Unicode代码点？(How can I detect and report Unicode code points that aren't legal for interchange using Perl?)

我正在使用Perl以Unicode的UTF-8字符编码方案处理数十万个纯文本文件。这些纯文本文档是法律发现过程中的计算机证据。我没有替换它们或忽略它们的奢侈。

我的问题是其中一些文件被垃圾污染：编码损坏的文本，无效的二进制数据等。我需要能够以Unicode术语检测并准确报告这些假定的纯文本文档的错误。换句话说，我必须确定是否存在特定类别的无效Unicode代码点：Unicode非字符，代理项和非Unicode字符。仅仅解决它们是不够的，我知道该怎么做。

使用Perl 5.14，如何检测和报告不适合交换的Unicode代码点？我大多只是在寻找如何开始的提示。

I'm using Perl to process hundreds of thousands of plain-text files in the UTF-8 character encoding scheme of Unicode. These plain-text documents are computer evidence in the legal discovery process. I don't have the luxury of either replacing them or ignoring them.

My problem is that some of these files are polluted with garbage: encoding-corrupted text, invalid binary data, etc. I need to be able to detect and report exactly what's wrong with these supposed plain-text documents in Unicode terms. In other words, I must identify the presence of specific categories of invalid Unicode code points: Unicode non-characters, surrogates, and non-Unicode characters. It's not enough just to work around them, which I know how to do.

Using Perl 5.14, how can I detect and report Unicode code points that aren't legal for interchange? I'm mostly just looking for hints on how to get started.

更多推荐

本文发布于:2023-08-03 02:26:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1382910.html