规范化unicode无法按预期工作(Normalize unicode does not work as expected)

编程入门行业动态更新时间:2024-10-24 04:30:55

我目前在特殊字符的不同unicode表示方面遇到了一些问题，尤其是带有重音符号或者diaereses的字符串等等。我编写了一个python脚本，它解析多个数据库转储并比较它们之间的值。问题是，在不同的文件中，这些特殊字符的存储方式不同。在某些文件中，这些字符组成，在其他文件中被分解。因为我希望总是在组合表示中从转储中提取字符串，所以我尝试添加以下行：

value = unicodedata.normalize("NFC", value)

但是，这在某些情况下仅解决了我的问题。例如，对于变音符号，它按预期工作。然而，像ë这样的字符将保留在分解的模式中（ e͏̈ ）。

我想通了， e和diaeresis字符之间有COMBINING GRAPHEME JOINER -character （U + 034F）。这是正常的，还是这可能导致我的问题？

有谁知道，如何处理这个问题？

I currently face some problems with different unicode representations of special characters, especially with the ones with accents or diaereses and so on. I wrote a python script, which parses multiple database dumps and compares values between them. The problem is, that in different file, these special characters are stored differently. In some files, these characters are composed, in others decomposed. As I want to have the string extracted from the dump always in the composed representation, I tried adding the following line:

value = unicodedata.normalize("NFC", value)

However, this solves my problem the only in some cases. For example, for umlauts it works as expected. Nevertheless, characters like ë will remain in the decomposed schema (e͏̈).

I figured out, that there is COMBINING GRAPHEME JOINER-character(U+034F) between the e and diaeresis character. Is that normal, or could this be cause the of my problem?

Does anybody know, how to handle this issue?

最满意答案

U+034F COMBINING GRAPHEME JOINER的目的是确保某些序列在搜索/排序/规范化下保持不同。这是正确处理字符和组合标记所必需的，因为它们在某些语言中使用Unicode算法。从Unicode标准的第23.2节（第805页）：

结合字形连接器（CGJ）的U + 034F用于影响相邻字符的校对，以用于语言敏感的校对和搜索。它还用于区分原本等效的序列。

...

反过来，这意味着在两个组合标记之间插入组合字形连接器将阻止标准化切换那两个组合标记的位置，而不管它们自己的组合类别。

一般情况下，如果没有一些特殊的知识，就不应该删除CGJ。

The purpose of U+034F COMBINING GRAPHEME JOINER is to ensure that certain sequences remain distinct under searching/sorting/normalisation. This is required for the correct handling of characters and combining marks as they are used in some languages with Unicode algorithms. From section 23.2 of the Unicode Standard (page 805):

U+034F combining grapheme joiner (CGJ) is used to affect the collation of adjacent characters for purposes of language-sensitive collation and searching. It is also used to distinguish sequences that would otherwise be canonically equivalent.

...

In turn, this means that insertion of a combining grapheme joiner between two combining marks will prevent normalization from switching the positions of those two combining marks, regardless of their own combining classes.

In general, you should not remove a CGJ without some special knowledge about why it was inserted in the first place.

更多推荐

本文发布于:2023-07-06 07:28:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1047333.html