最长匹配仅与 Spacy Phrasematcher

编程入门行业动态更新时间:2024-10-17 02:58:23

本文介绍了最长匹配仅与 Spacy Phrasematcher的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我创建了一个 Spacy Phrasematcher 来匹配文档中的名称，遵循教程.我想使用结果匹配作为额外的训练数据来训练一个 Spacy NER 模型.但是，我的模式分别包含全名(例如Barack Obama")和姓氏(Obama").

I have created a Spacy Phrasematcher to match names in a document, following the tutorial. I want to use the resulting matches as additional training data in order to train a Spacy NER model. My patterns, however, contain both full names (e.g. 'Barack Obama') and last names ('Obama') separately.

因此，在包含Barack Obama"的句子中，两种模式都匹配，导致匹配重叠.但是，当我尝试使用数据进行训练时，这种重叠会触发异常，例如:

Hence, in a sentence that contains 'Barack Obama', both patterns match, resulting in overlapping matches. This overlap, however, triggers an exception when I try to use the data for training, e.g.:

ValueError: [E103] Trying to set conflicting doc.ents: '(19, 33, 'PERSON')' and '(29, 33, 'PERSON')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

我一直在考虑在使用数据进行训练之前过滤掉重叠的匹配，但这似乎是一种非常低效的方法，导致处理大数据的时间显着增加.

I've been considering to filter out overlapping matches before using the data for training, but this seems like a very inefficient approach, resulting in a significant increase in processing time for large data.

有没有办法设置 PhraseMatcher 以便它只匹配最长匹配的重叠匹配?

Is there a way to set up a PhraseMatcher so that it only matches the longest match for overlapping matches?

推荐答案

PhraseMatcher 没有内置的方法在匹配时过滤掉重叠的匹配，但有一个实用函数之后过滤重叠匹配:spacy.util.filter_spans().它更喜欢最长的跨度，如果两个重叠的跨度长度相同，则文本中的跨度越早.

The PhraseMatcher doesn't have a built-in way to filter out overlapping matches while it's matching, but there is a utility function to filter overlapping matches afterwards: spacy.util.filter_spans(). It prefers the longest span and if two overlapping spans are the same length, the earlier span in the text.

更多推荐

最长匹配仅与 Spacy Phrasematcher

本文发布于:2023-10-26 18:33:44，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1531003.html