R 中的 Unicode 规范化(形式 C):将所有带有重音符号的字符转换为它们的单一 Unicode 字符形式?

编程入门 行业动态 更新时间:2024-10-27 21:12:30
本文介绍了R 中的 Unicode 规范化(形式 C):将所有带有重音符号的字符转换为它们的单一 Unicode 字符形式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Unicode 中,带重音的字母可以用两种方式表示:重音字母本身,以及裸字母加重音的组合.例如,é (+U00E9) 和 e´ (+U0065 +U0301) 通常以相同的方式显示.

R 呈现以下内容(版本 3.0.2,Mac OS 10.7.5):

>u00e9"[1] "é">u0065u0301"[1] "é"

当然:

>"u00e9" == "u0065u0301"[1] 错误

R 中是否有一个函数可以将两个 unicode-character-letter 转换成一个字符的形式? 特别是,这里它会崩溃 "u0065u0301""u00e9".

这对于处理大量字符串非常方便.此外,单字符形式可以通过 iconv 轻松转换为其他编码——至少对于通常的 Latin1 字符——并且通过 plot 更好地处理.>

非常感谢.

解决方案

好吧,看来已经开发了一个包来增强和简化 R 中的字符串操作工具箱(终于!).它被称为 stringi,看起来很有前途.它的文档写得很好,特别是我找到了关于编码 和 locales 比一些标准有关该主题的 R 文档.

它具有 Unicode 规范化功能,正如我所寻找的(此处为 C 形式):

>stri_trans_nfc('u00e9') == stri_trans_nfc('u0065u0301')[1] 真

它还包含一个智能的比较功能,它集成了这些标准化问题并减轻了不得不考虑它们:

>stri_compare('u00e9', 'u0065u0301')[1] 0# 即相等;# 否则按字母顺序返回 1 或 -1,即更大或更小.

感谢开发者 Marek Gągolewski 和 Bartek Tartanus,感谢 Kurt Hornik 提供的信息!

In Unicode, letters with accents can be represented in two ways: the accentuated letter itself, and the combination of the bare letter plus the accent. For example, é (+U00E9) and e´ (+U0065 +U0301) are usually displayed in the same way.

R renders the following (version 3.0.2, Mac OS 10.7.5):

> "u00e9"
[1] "é"
> "u0065u0301"
[1] "é"

However, of course:

> "u00e9" == "u0065u0301"
[1] FALSE

Is there a function in R which converts two-unicode-character-letters into their one-character form? In particular, here it would collapse "u0065u0301" into "u00e9".

That would be extremely handy to process large quantities of strings. Plus, the one-character forms can easily be converted to other encodings via iconv -- at least for the usual Latin1 characters -- and is better handled by plot.

Thanks a lot in advance.

解决方案

Ok, it appears that a package has been developed to enhance and simplify the string manipulation toolbox in R (finally!). It is called stringi and looks very promising. Its documentation is very well written, and in particular I find the pages about encodings and locales much more enlightening than some of the standard R documentation on the subject.

It has Unicode normalization functions, as I was looking for (here form C):

> stri_trans_nfc('u00e9') == stri_trans_nfc('u0065u0301')
[1] TRUE

It also contains a smart comparison function which integrates these normalization questions and lessens the pain of having to think about them:

> stri_compare('u00e9', 'u0065u0301')
[1] 0
# i.e. equal ;
# otherwise it returns 1 or -1, i.e. greater or lesser, in the alphabetic order.

Thanks to the developers, Marek Gągolewski and Bartek Tartanus, and to Kurt Hornik for the info!

这篇关于R 中的 Unicode 规范化(形式 C):将所有带有重音符号的字符转换为它们的单一 Unicode 字符形式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

更多推荐

[db:关键词]

本文发布于:2023-03-31 01:52:48,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/793608.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:字符   形式   重音   转换为   符号

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!