分离Unicode连字符

编程入门 行业动态 更新时间:2024-10-12 16:21:51
本文介绍了分离Unicode连字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

在大量的unicode字符中,有一些实际上代表了多个字符,比如两个'f'字符的U + FB00连字ff。有没有什么方法可以轻松将这些字符转换为多个单个字符?最好是标准Java API中可用的东西,但如果需要,我可以引用外部库。

Throughout the vast number of unicode characters, there are some that actually represent more than one character, like the U+FB00 ligature ff for two 'f' characters. Is there any way easy to convert characters like these into multiple single characters? Preferably something available in the standard Java API, but I can refer to an external library if need be.

推荐答案

U + FB00是兼容性角色。通常,Unicode不支持连字的单独代码点(如果应该使用连字并且不应该影响数据的存储方式,则认为这是布局决策)。 少数仍然存在,以允许往返转换兼容旧的编码做将连字表示为单独的实体。

U+FB00 is a compatibility character. Normally Unicode doesn't support separate codepoints for ligatures (arguing that it's a layout decision if and when a ligature should be used and should not influence how the data is stored). A few of those still exist to allow round-trip conversion compatibility with older encodings that do represent ligatures as separate entities.

幸运的是,结合字符所代表的信息 存在于 Unicode数据文件和大多数功能强大的字符串处理系统都内置了这些数据。

Luckily, the information which characters the ligature represents is present in the Unicode data file and most capable string handling systems have that data built-in.

在Java中,你需要使用 Normalizer class 和 NFKC 表格:

In Java, you'll need to use the Normalizer class and the NFKC form:

String ff ="\uFB00"; String normalized = Normalizer.normalize(ff, Form.NFKC); System.out.println(ff + " = " + normalized);

这将打印

ff = ff

更多推荐

分离Unicode连字符

本文发布于:2023-11-29 13:42:42,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1646545.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:字符   Unicode

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!