在大量的unicode字符中,有一些实际上代表了多个字符,比如两个'f'字符的U + FB00连字ff。有没有什么方法可以轻松将这些字符转换为多个单个字符?最好是标准Java API中可用的东西,但如果需要,我可以引用外部库。
Throughout the vast number of unicode characters, there are some that actually represent more than one character, like the U+FB00 ligature ff for two 'f' characters. Is there any way easy to convert characters like these into multiple single characters? Preferably something available in the standard Java API, but I can refer to an external library if need be.
推荐答案U + FB00是兼容性角色。通常,Unicode不支持连字的单独代码点(如果应该使用连字并且不应该影响数据的存储方式,则认为这是布局决策)。 少数仍然存在,以允许往返转换兼容旧的编码做将连字表示为单独的实体。
U+FB00 is a compatibility character. Normally Unicode doesn't support separate codepoints for ligatures (arguing that it's a layout decision if and when a ligature should be used and should not influence how the data is stored). A few of those still exist to allow round-trip conversion compatibility with older encodings that do represent ligatures as separate entities.
幸运的是,结合字符所代表的信息 存在于 Unicode数据文件和大多数功能强大的字符串处理系统都内置了这些数据。
Luckily, the information which characters the ligature represents is present in the Unicode data file and most capable string handling systems have that data built-in.
在Java中,你需要使用 Normalizer class 和 NFKC 表格:
In Java, you'll need to use the Normalizer class and the NFKC form:
String ff ="\uFB00"; String normalized = Normalizer.normalize(ff, Form.NFKC); System.out.println(ff + " = " + normalized);这将打印
ff = ff更多推荐
分离Unicode连字符
发布评论