在提供Lucene Index时使用免费工具进行实体提取/识别

编程入门 行业动态 更新时间:2024-10-10 13:18:44
本文介绍了在提供Lucene Index时使用免费工具进行实体提取/识别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我目前正在研究从文本(网络上的很多文章)中提取人物姓名,位置,技术用语和类别的选项,然后将其输入到Lucene/ElasticSearch索引中.然后,附加信息将作为元数据添加,并应提高搜索的准确性.

I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.

例如当有人询问检票口"时,他应该能够决定他是指板球运动还是阿帕奇项目.到目前为止,我尝试自己实施此方法,但收效甚微.现在,我发现了很多工具,但是我不确定它们是否适合此任务,哪些与Lucene集成得很好,或者实体提取的精度是否足够高.

E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.

  • Dbpedia Spotlight ,演示看起来很有前途
  • OpenNLP 需要培训中训练一个命名实体-reconigzer-identifier .使用哪些培训数据?
  • OpenNLP工具
  • Stanbol
  • NLTK
  • balie
  • UIMA
  • 门-> 示例代码
  • Apache Mahout
  • 斯坦福CRF-NER
  • maui-indexer
  • Mallet
  • 伊利诺伊州命名实体标记符不是开源的,而是免费的
  • Wikipedianer数据
  • Dbpedia Spotlight, the demo looks very promising
  • OpenNLP requires training. Which training data to use?
  • OpenNLP tools
  • Stanbol
  • NLTK
  • balie
  • UIMA
  • GATE -> example code
  • Apache Mahout
  • Stanford CRF-NER
  • maui-indexer
  • Mallet
  • Illinois Named Entity Tagger Not open source but free
  • wikipedianer data

我的问题:

  • 是否有人对上面列出的某些工具及其精度/召回率有经验?或者,如果需要培训数据+可用.
  • 是否有文章或教程可让您开始使用每种工具的实体提取(NER)?
  • 如何将它们与Lucene集成?

以下是与该主题相关的一些问题:

Here are some questions related to that subject:

  • 存在用于帮助检测主要主题"的算法.一个英语句子?
  • Java的命名实体识别库
  • 使用Java命名实体识别
  • Does an algorithm exist to help detect the "primary topic" of an English sentence?
  • Named Entity Recognition Libraries for Java
  • Named entity recognition with Java
推荐答案

在检票口"示例中面临的问题称为实体消歧,而不是实体提取/识别(NER). NER可能有用,但仅当类别足够具体时才有用.大多数NER系统没有足够的粒度来区分运动项目和软件项目(这两种类型都超出了通常公认的类型:人员,组织,位置).

The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).

要消除歧义,您需要一个针对实体进行歧义消除的知识库.由于DBpedia具有广泛的覆盖范围,因此是一个典型的选择.请参阅我的答案,以获取如何使用DBPedia提取内容中的标签/关键字?,在这里我提供了更多解释,并提到了一些用于歧义消除的工具,包括:

For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:

  • Zemanta
  • 毛伊岛索引器
  • Dbpedia Spotlight
  • 摘录(我的公司)
  • Zemanta
  • Maui-indexer
  • Dbpedia Spotlight
  • Extractiv (my company)

这些工具通常使用诸如REST之类的独立于语言的API,我不知道它们直接提供了Lucene支持,但我希望我的回答对您要解决的问题有所帮助.

These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.

更多推荐

在提供Lucene Index时使用免费工具进行实体提取/识别

本文发布于:2023-11-23 06:35:23,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1620455.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:实体   工具   Lucene   Index

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!