如何删除标点符号?

编程入门 行业动态 更新时间:2024-10-15 14:16:51
本文介绍了如何删除标点符号?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在使用来自 NLTK的Python中的标记生成器.

已经有很多答案可以消除论坛上的标点符号.但是,它们都不能同时解决以下所有问题:

There are whole bunch of answers for removing punctuations on the forum already. However, none of them address all of the following issues together:

  • 连续多个符号.例如,句子:他说,就是这样".因为引号后面有逗号,所以标记化程序不会删除..标记化程序将给出['He','said',',",'that','s','it. ']代替['He','said','that','s','it'].其他一些示例包括'...','-','!?',',''等.
  • 删除句子结尾处的符号.即句子:Hello World.分词器将给出['Hello','World.]而不是['Hello','World'].请注意世界"一词末尾的句点.其他一些示例在任何字符的开头,中间或结尾都包含-",.".
  • 删除前面和后面带有符号的字符.即'*u*', '''','""'
  • More than one symbol in a row. For example, the sentence: He said,"that's it." Because there's a comma followed by quotation mark, the tokenizer won't remove ." in the sentence. The tokenizer will give ['He', 'said', ',"', 'that', 's', 'it.'] instead of ['He','said', 'that', 's', 'it']. Some other examples include '...', '--', '!?', ',"', and so on.
  • Remove symbol at the end of the sentence. i.e. the sentence: Hello World. The tokenizer will give ['Hello', 'World.'] instead of ['Hello', 'World']. Notice the period at the end of the word 'World'. Some other examples include '--',',' in the beginning, middle, or end of any character.
  • Remove characters with symbols in front and after. i.e. '*u*', '''','""'
  • 有解决这两个问题的优雅方法吗?

    Is there an elegant way of solving both problems?

    推荐答案

    如果要一次性对字符串进行标记化,我认为唯一的选择就是使用nltk.tokenize.RegexpTokenizer.通过以下方法,您可以在完全删除标点符号之前,使用标点符号作为标记来删除字母字符(如您的第三个要求中所述).换句话说,这种方法将在剥离所有标点符号之前将*u*删除.

    If you want to tokenize your string all in one shot, I think your only choice will be to use nltk.tokenize.RegexpTokenizer. The following approach will allow you to use punctuation as a marker to remove characters of the alphabet (as noted in your third requirement) before removing the punctuation altogether. In other words, this approach will remove *u* before stripping all punctuation.

    那么,解决此问题的一种方法是对空白进行标记化:

    One way to go about this, then, is to tokenize on gaps like so:

    >>> from nltk.tokenize import RegexpTokenizer >>> s = '''He said,"that's it." *u* Hello, World.''' >>> toker = RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True) >>> toker.tokenize(s) ['He', 'said', 'that', 's', 'it', 'Hello', 'World'] # omits *u* per your third requirement

    这应满足您在上面指定的所有三个条件.但是请注意,该令牌生成器不会返回诸如"A"之类的令牌.此外,我只对以和开头并以标点符号结尾的单个字母进行标记.否则,开始".不会返回令牌.您可能需要以其他方式细化正则表达式,具体取决于数据的外观和期望.

    This should meet all three of the criteria you specified above. Note, however, that this tokenizer will not return tokens such as "A". Furthermore, I only tokenize on single letters that begin and end with punctuation. Otherwise, "Go." would not return a token. You may need to nuance the regex in other ways, depending on what your data looks like and what your expectations are.

    更多推荐

    如何删除标点符号?

    本文发布于:2023-10-08 15:14:00,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1473000.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:标点符号

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!