最长匹配仅与 Spacy Phrasematcher

编程入门 行业动态 更新时间:2024-10-17 02:58:23
本文介绍了最长匹配仅与 Spacy Phrasematcher的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我创建了一个 Spacy Phrasematcher 来匹配文档中的名称,遵循 教程.我想使用结果匹配作为额外的训练数据来训练一个 Spacy NER 模型.但是,我的模式分别包含全名(例如Barack Obama")和姓氏(Obama").

I have created a Spacy Phrasematcher to match names in a document, following the tutorial. I want to use the resulting matches as additional training data in order to train a Spacy NER model. My patterns, however, contain both full names (e.g. 'Barack Obama') and last names ('Obama') separately.

因此,在包含Barack Obama"的句子中,两种模式都匹配,导致匹配重叠.但是,当我尝试使用数据进行训练时,这种重叠会触发异常,例如:

Hence, in a sentence that contains 'Barack Obama', both patterns match, resulting in overlapping matches. This overlap, however, triggers an exception when I try to use the data for training, e.g.:

ValueError: [E103] Trying to set conflicting doc.ents: '(19, 33, 'PERSON')' and '(29, 33, 'PERSON')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

我一直在考虑在使用数据进行训练之前过滤掉重叠的匹配,但这似乎是一种非常低效的方法,导致处理大数据的时间显着增加.

I've been considering to filter out overlapping matches before using the data for training, but this seems like a very inefficient approach, resulting in a significant increase in processing time for large data.

有没有办法设置 PhraseMatcher 以便它只匹配最长匹配的重叠匹配?

Is there a way to set up a PhraseMatcher so that it only matches the longest match for overlapping matches?

推荐答案

PhraseMatcher 没有内置的方法在匹配时过滤掉重叠的匹配,但有一个实用函数之后过滤重叠匹配:spacy.util.filter_spans().它更喜欢最长的跨度,如果两个重叠的跨度长度相同,则文本中的跨度越早.

The PhraseMatcher doesn't have a built-in way to filter out overlapping matches while it's matching, but there is a utility function to filter overlapping matches afterwards: spacy.util.filter_spans(). It prefers the longest span and if two overlapping spans are the same length, the earlier span in the text.

更多推荐

最长匹配仅与 Spacy Phrasematcher

本文发布于:2023-10-26 18:33:44,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1531003.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:最长   Spacy   Phrasematcher

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!