是否可以对标记中的文本进行标记,以便将名字和姓氏组合在一个标记中?例如,如果我的文字是:
Is is possible to tokenize a text in tokens such that first and last name are combined in one token? For example if my text is:
text = "Barack Obama is the President"那么:
text.split()结果:
['Barack', 'Obama', 'is', 'the, 'President']我如何识别名字和姓氏?所以我只得到 ['Barack Obama', 'is', 'the', 'President'] 作为标记.
how can I recognize the first and last name? So I get only ['Barack Obama', 'is', 'the', 'President'] as tokens.
有没有办法在 Python 中实现它?
Is there a way to achieve it in Python?
推荐答案您正在寻找的是命名实体识别系统.我建议您不要将此视为标记化的一部分.
What you are looking for is a named entity recognition system. I suggest you do not consider this as part of tokenization.
对于python,您可以使用pypi.python/pypi/ner/
For python you can use pypi.python/pypi/ner/
来自网站的示例
>>>tagger.json_entities("爱丽丝去了自然历史博物馆.")'{组织":[自然历史博物馆"],人物":[爱丽丝"]}'
>>> tagger.json_entities("Alice went to the Museum of Natural History.") '{"ORGANIZATION": ["Museum of Natural History"], "PERSON": ["Alice"]}'
更多推荐
将名字和姓氏标记为一个标记
发布评论