如何在bash中将包含非ascii字符的文件拆分为单词?(How to split a file containing non

编程入门 行业动态 更新时间:2024-10-23 12:24:41
如何在bash中将包含非ascii字符的文件拆分为单词?(How to split a file containing non-ascii characters into words, in bash?)

例如,我有一个普通文本的文件,如:

"Word1 Kuͦn, buͤtten; word4:"

我想得到一个每行1个单词的文件,保持点状,并命令:

, : ; Word1 Kuͦn buͤtten word4

我使用的代码:

grep -Eo '\w+|[^\w ]' input.txt | sort -f >> output.txt

这个代码几乎完美地工作,除了一件事:它将diacretical字符与它们所属的字母分开,好像它们是单独的单词:

, : ; Word1 Ku ͦ n bu ͤ tten word4

字母uͦ,uͤ和其他具有相同diacretics的字母不在ASCII表中。 如何在不删除或替换这些字符的情况下正确拆分文件?

编辑:

locale输出:

LANG= LC_COLLATE="C" LC_CTYPE="UTF-8" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL=

For example, I have a file with normal text, like:

"Word1 Kuͦn, buͤtten; word4:"

I want to get a file with 1 word per line, keeping the punctiuation, and ordered:

, : ; Word1 Kuͦn buͤtten word4

The code I use:

grep -Eo '\w+|[^\w ]' input.txt | sort -f >> output.txt

This the code works almost perfectly, except for one thing: it splits diacretical characters apart from the letters they belong to, as if they were separate words:

, : ; Word1 Ku ͦ n bu ͤ tten word4

The letters uͦ, uͤ and other with the same diacretics are not in the ASCII table. How can I split my file correctly without deleting or replacing these characters?

Edit:

locale output:

LANG= LC_COLLATE="C" LC_CTYPE="UTF-8" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL=

最满意答案

不幸的是,U + 366(组合拉丁文小写字母O)不是字母字符。 它是一个非间距标记,unicode类别Mn ,通常映射到Posix ctype cntrl 。

粗略地说,字母字母是字母字符,可能后跟一个或多个组合字符。 如果您有一个实现Unicode常规类别的正则表达式库,则可以将其写为正则表达式模式。 Gnu grep通常使用流行的pcre (Perl兼容的正则表达式)库进行编译,该库具有相当好的Unicode支持。 所以,如果你有Gnu grep,那么你很幸运。

要启用“perl-like”正则表达式,需要使用-P选项(或pgrep )调用grep 。 但是,这还不够,因为默认情况下,即使locale指定了UTF-8编码, grep也会使用8位编码。 因此,您需要将正则表达式系统置于“UTF-8”模式,以使其识别您的字符编码。

将所有这些放在一起,您最终可能会得到以下内容:

grep -Po '(*UTF8)(\p{L}\p{M}*|\p{N})+|[\p{P}\p{S}]' -P patterns are "perl-compatible" -o output each substring matched (*UTF8) If the pattern starts with exactly this sequence, pcre is put into UTF-8 mode. \p{...} Select a character in a specified Unicode general category \P{...} Select a character not in a specified Unicode general category \p{L} General category L: letters \p{N} General category N: numbers \p{M} General category M: combining marks \p{P} General category P: punctuation \p{S} General category S: symbols \p{L}\p{M}* A letter possibly followed by various combining marks \p{L}\p{M}*|\p{N} ... or a number

有关Unicode常规类别和Unicode正则表达式匹配的更多信息,请参阅Unicode技术报告18中有关正则表达式匹配的信息。 但请注意,该TR中描述的语法是推荐,并且大多数正则表达式库并未完全实现。 特别是, pcre不支持有用的符号\p{L|N} (字母或数字)。 相反,您需要使用[\p{L}\p{N}] 。

有关pcre文档可能在您的系统上可用( man pcre ); 如果没有,请给我一个链接 。

如果您没有Gnu grep或者在不太可能的情况下您的版本是在没有pcre支持的情况下编译的,那么您可以使用perl , python或其他语言与正则表达式功能。 但是,这样做非常困难。 经过一些实验,我发现以下Perl咒语似乎有效:

perl -CIO -lne 'print $& while /(\p{L}\p{M}*|\p{N})+|[\p{P}\p{S}]/g'

这里, -CIO告诉Perl输入和输出为UTF-8, -nle是标准咒语,意思是“在打印后自动输出新的** l ines”;循环遍历每个li ** n **输入,** e **在循环中执行以下操作“。

Unfortunately, U+366 (COMBINING LATIN SMALL LETTER O) is not an alphabetic character. It is a non-spacing mark, unicode category Mn, which generally maps to the Posix ctype cntrl.

Roughly speaking, an alphabetic grapheme is an alphabetic character possibly followed by one or more combining characters. It's possible to write that as a regex pattern if you have a regex library which implements Unicode general categories. Gnu grep is usually compiled with an interface to the popular pcre (Perl-compatible regular expression) library, which has reasonably good Unicode support. So if you have Gnu grep, you're in luck.

To enable "perl-like" regular expressions, you need to invoke grep with the -P option (or as pgrep). However, that is not quite enough because by default grep will use an 8-bit encoding even if the locale specifies a UTF-8 encoding. So you need to put the regex system into "UTF-8" mode in order to get it to recognize your character encoding.

Putting all that together, you might end up with something like the following:

grep -Po '(*UTF8)(\p{L}\p{M}*|\p{N})+|[\p{P}\p{S}]' -P patterns are "perl-compatible" -o output each substring matched (*UTF8) If the pattern starts with exactly this sequence, pcre is put into UTF-8 mode. \p{...} Select a character in a specified Unicode general category \P{...} Select a character not in a specified Unicode general category \p{L} General category L: letters \p{N} General category N: numbers \p{M} General category M: combining marks \p{P} General category P: punctuation \p{S} General category S: symbols \p{L}\p{M}* A letter possibly followed by various combining marks \p{L}\p{M}*|\p{N} ... or a number

More information on Unicode general categories and Unicode regular expression matching in general can be found in Unicode Technical Report 18 on regular expression matching. But beware that the syntax described in that TR is a recommendation and is not exactly implemented by most regex libraries. In particular, pcre does not support the useful notation \p{L|N} (letter or number). Instead, you need to use [\p{L}\p{N}].

Documentation about pcre is probably available on your system (man pcre); if not, have a link on me.

If you don't have Gnu grep or in the unlikely case that your version was compiled without pcre support, you might be able to use perl, python or other languages with regex capabilites. However, doing so is surprisingly difficult. After some experimentation, I found the following Perl incantation which seems to work:

perl -CIO -lne 'print $& while /(\p{L}\p{M}*|\p{N})+|[\p{P}\p{S}]/g'

Here, -CIO tells Perl that input and output in UTF-8, and -nle is a standard incantation which means "automatically output new**l**ines after a print; loop through every li**n**e of the input, **e**xecuting the following in the loop".

更多推荐

本文发布于:2023-07-15 18:58:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1117526.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:中将   单词   字符   文件   如何在

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!