我创建了一个 Spacy Phrasematcher 来匹配文档中的名称,遵循 教程.我想使用结果匹配作为额外的训练数据来训练一个 Spacy NER 模型.但是,我的模式分别包含全名(例如Barack Obama")和姓氏(Obama").
I have created a Spacy Phrasematcher to match names in a document, following the tutorial. I want to use the resulting matches as additional training data in order to train a Spacy NER model. My patterns, however, contain both full names (e.g. 'Barack Obama') and last names ('Obama') separately.
因此,在包含Barack Obama"的句子中,两种模式都匹配,导致匹配重叠.但是,当我尝试使用数据进行训练时,这种重叠会触发异常,例如:
Hence, in a sentence that contains 'Barack Obama', both patterns match, resulting in overlapping matches. This overlap, however, triggers an exception when I try to use the data for training, e.g.:
ValueError: [E103] Trying to set conflicting doc.ents: '(19, 33, 'PERSON')' and '(29, 33, 'PERSON')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.我一直在考虑在使用数据进行训练之前过滤掉重叠的匹配,但这似乎是一种非常低效的方法,导致处理大数据的时间显着增加.
I've been considering to filter out overlapping matches before using the data for training, but this seems like a very inefficient approach, resulting in a significant increase in processing time for large data.
有没有办法设置 PhraseMatcher 以便它只匹配最长匹配的重叠匹配?
Is there a way to set up a PhraseMatcher so that it only matches the longest match for overlapping matches?
推荐答案PhraseMatcher 没有内置的方法在匹配时过滤掉重叠的匹配,但有一个实用函数之后过滤重叠匹配:spacy.util.filter_spans().它更喜欢最长的跨度,如果两个重叠的跨度长度相同,则文本中的跨度越早.
The PhraseMatcher doesn't have a built-in way to filter out overlapping matches while it's matching, but there is a utility function to filter overlapping matches afterwards: spacy.util.filter_spans(). It prefers the longest span and if two overlapping spans are the same length, the earlier span in the text.
更多推荐
最长匹配仅与 Spacy Phrasematcher
发布评论