如何检测和报告使用Perl进行交换不合法的Unicode代码点?(How can I detect and report Unicode code points that aren't le

编程入门 行业动态 更新时间:2024-10-27 10:26:46
如何检测和报告使用Perl进行交换不合法的Unicode代码点?(How can I detect and report Unicode code points that aren't legal for interchange using Perl?)

我正在使用Perl以Unicode的UTF-8字符编码方案处理数十万个纯文本文件。 这些纯文本文档是法律发现过程中的计算机证据。 我没有替换它们或忽略它们的奢侈。

我的问题是其中一些文件被垃圾污染:编码损坏的文本,无效的二进制数据等。我需要能够以Unicode术语检测并准确报告这些假定的纯文本文档的错误 换句话说,我必须确定是否存在特定类别的无效Unicode代码点:Unicode非字符,代理项和非Unicode字符。 仅仅解决它们是不够的,我知道该怎么做。

使用Perl 5.14,如何检测和报告不适合交换的Unicode代码点? 我大多只是在寻找如何开始的提示。

I'm using Perl to process hundreds of thousands of plain-text files in the UTF-8 character encoding scheme of Unicode. These plain-text documents are computer evidence in the legal discovery process. I don't have the luxury of either replacing them or ignoring them.

My problem is that some of these files are polluted with garbage: encoding-corrupted text, invalid binary data, etc. I need to be able to detect and report exactly what's wrong with these supposed plain-text documents in Unicode terms. In other words, I must identify the presence of specific categories of invalid Unicode code points: Unicode non-characters, surrogates, and non-Unicode characters. It's not enough just to work around them, which I know how to do.

Using Perl 5.14, how can I detect and report Unicode code points that aren't legal for interchange? I'm mostly just looking for hints on how to get started.

更多推荐

本文发布于:2023-08-03 02:26:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1382910.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:不合法   代码   报告   Unicode   Perl

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!