spacy 如何将话题标签作为一个整体进行标记?

编程入门行业动态更新时间:2024-10-26 20:29:16

本文介绍了spacy 如何将话题标签作为一个整体进行标记?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

限时送ChatGPT账号..

在包含主题标签的句子中，例如推文，spacy 的标记器将主题标签拆分为两个标记:

导入空间nlp = spacy.load('en')doc = nlp(u'This is a #sentence.')[文档中的 t 表示 t]

输出:

[This, is, a, #, sentence, .]

我想按如下方式标记标签，这可能吗?

[This, is, a, #sentence, .]

解决方案

您可以进行一些前后字符串操作，这将使您绕过基于#"的标记化，并且易于实现.例如

<块引用>

<代码>>>>>进口重新>>>>进口空间>>>>nlp = spacy.load('en')>>>>句子 = u'这是我的推特更新 #MyTopic'>>>>解析 = nlp(句子)>>>>[token.text for token in parsed]

 [u'This', u'is', u'my', u'twitter', u'update', u'#', u'MyTopic']

<块引用>

<代码>>>>>new_sentence = re.sub(r'#(\w+)',r'ZZZPLACEHOLDERZZZ\1',sentence)>>>>new_sentence u'这是我的推特更新 ZZZPLACEHOLDERZZZMyTopic'>>>>解析 = nlp(new_sentence)>>>>[token.text for token in parsed]

 [u'This', u'is', u'my', u'twitter', u'update', u'ZZZPLACEHOLDERZZZMyTopic']

<块引用>

<代码>>>>>[x.replace(u'ZZZPLACEHOLDERZZZ','#') for x in [token.text for token in parsed]]

 [u'This', u'is', u'my', u'twitter', u'update', u'#MyTopic']

您可以尝试在 spacy 的标记器中设置自定义分隔符.我不知道这样做的方法.

UPDATE :您可以使用正则表达式来查找您希望作为单个令牌保留的令牌范围，并使用此处提到的 span.merge 方法重新标记:https://spacy.io/docs/api/span#merge

合并示例:

<预><代码>>>>进口空间>>>进口重新>>>nlp = spacy.load('en')>>>my_str = u'Tweet hashtags #MyHashOne #MyHashTwo'>>>解析 = nlp(my_str)>>>[(x.text,x.pos_) for x in parsed][(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#', u'NOUN'), (u'MyHashOne', u'NOUN'), (u'#', u'NOUN'), (u'MyHashTwo', u'PROPN')]>>>index = [m.span() for m in re.finditer('#\w+',my_str,flags=re.IGNORECASE)]>>>索引[(15, 25), (26, 36)]>>>对于开始，结束于索引:... parsed.merge(start_idx=start,end_idx=end)...#MyHashOne#MyHashTwo>>>[(x.text,x.pos_) for x in parsed][(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#MyHashOne', u'NOUN'), (u'#MyHashTwo', u'PROPN')]>>>

In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens:

import spacy
nlp = spacy.load('en')
doc = nlp(u'This is a #sentence.')
[t for t in doc]

output:

[This, is, a, #, sentence, .]

I'd like to have hashtags tokenized as follows, is that possible?

[This, is, a, #sentence, .]

解决方案

You can do some pre and post string manipulations,which shall make you bypass '#' based tokenization, and is easy to implement. e.g

> >>> import re
> >>> import spacy
> >>> nlp = spacy.load('en')
> >>> sentence = u'This is my twitter update #MyTopic'
> >>> parsed = nlp(sentence)
> >>> [token.text for token in parsed]

 [u'This', u'is', u'my', u'twitter', u'update', u'#', u'MyTopic']

> >>> new_sentence = re.sub(r'#(\w+)',r'ZZZPLACEHOLDERZZZ\1',sentence) 
> >>> new_sentence u'This is my twitter update ZZZPLACEHOLDERZZZMyTopic'
> >>> parsed = nlp(new_sentence)
> >>> [token.text for token in parsed]

 [u'This', u'is', u'my', u'twitter', u'update', u'ZZZPLACEHOLDERZZZMyTopic']

> >>> [x.replace(u'ZZZPLACEHOLDERZZZ','#') for x in [token.text for token in parsed]]

 [u'This', u'is', u'my', u'twitter', u'update', u'#MyTopic']

You can try setting custom seperators in spacy's tokenizer. I am not aware of methods to do that.

UPDATE : You can use a regex to find span of token you would want to stay as single token, and retokenize using span.merge method as mentioned here : https://spacy.io/docs/api/span#merge

Merge example:

>>> import spacy
>>> import re
>>> nlp = spacy.load('en')
>>> my_str = u'Tweet hashtags #MyHashOne #MyHashTwo'
>>> parsed = nlp(my_str)
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#', u'NOUN'), (u'MyHashOne', u'NOUN'), (u'#', u'NOUN'), (u'MyHashTwo', u'PROPN')]
>>> indexes = [m.span() for m in re.finditer('#\w+',my_str,flags=re.IGNORECASE)]
>>> indexes
[(15, 25), (26, 36)]
>>> for start,end in indexes:
...     parsed.merge(start_idx=start,end_idx=end)
... 
#MyHashOne
#MyHashTwo
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#MyHashOne', u'NOUN'), (u'#MyHashTwo', u'PROPN')]
>>>

这篇关于spacy 如何将话题标签作为一个整体进行标记?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

更多推荐

[db:关键词]

本文发布于:2023-04-30 04:44:26，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1389262.html