我正在使用Perl以Unicode的UTF-8字符编码方案处理数十万个纯文本文件。 这些纯文本文档是法律发现过程中的计算机证据。 我没有替换它们或忽略它们的奢侈。
我的问题是其中一些文件被垃圾污染:编码损坏的文本,无效的二进制数据等。我需要能够以Unicode术语检测并准确报告这些假定的纯文本文档的错误。 换句话说,我必须确定是否存在特定类别的无效Unicode代码点:Unicode非字符,代理项和非Unicode字符。 仅仅解决它们是不够的,我知道该怎么做。
使用Perl 5.14,如何检测和报告不适合交换的Unicode代码点? 我大多只是在寻找如何开始的提示。
I'm using Perl to process hundreds of thousands of plain-text files in the UTF-8 character encoding scheme of Unicode. These plain-text documents are computer evidence in the legal discovery process. I don't have the luxury of either replacing them or ignoring them.
My problem is that some of these files are polluted with garbage: encoding-corrupted text, invalid binary data, etc. I need to be able to detect and report exactly what's wrong with these supposed plain-text documents in Unicode terms. In other words, I must identify the presence of specific categories of invalid Unicode code points: Unicode non-characters, surrogates, and non-Unicode characters. It's not enough just to work around them, which I know how to do.
Using Perl 5.14, how can I detect and report Unicode code points that aren't legal for interchange? I'm mostly just looking for hints on how to get started.
更多推荐
发布评论