nltk StanfordNERTagger:如何在不使用大写字母的情况下获取专有名词

编程入门行业动态更新时间:2024-10-25 10:23:23

本文介绍了nltk StanfordNERTagger:如何在不使用大写字母的情况下获取专有名词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我正在尝试使用StanfordNERTagger和nltk从一段文本中提取关键字.

I am trying to use the StanfordNERTagger and nltk to extract keywords from a piece of text.

docText="John Donk works for POI. Brian Jones wants to meet with Xyz Corp. for measuring POI's Short Term performance Metrics." words = re.split("\W+",docText) stops = set(stopwords.words("english")) #remove stop words from the list words = [w for w in words if w not in stops and len(w) > 2] str = " ".join(words) print str stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP'] print "Stanford POS Tagged" print stanfordPosTagList tagged = stn.tag(stanfordPosTagList) print tagged

这给了我

John Donk works POI Brian Jones wants meet Xyz Corp measuring POI Short Term performance Metrics Stanford POS Tagged [u'John', u'Donk', u'POI', u'Brian', u'Jones', u'Xyz', u'Corp', u'POI', u'Short', u'Term'] [(u'John', u'PERSON'), (u'Donk', u'PERSON'), (u'POI', u'ORGANIZATION'), (u'Brian', u'ORGANIZATION'), (u'Jones', u'ORGANIZATION'), (u'Xyz', u'ORGANIZATION'), (u'Corp', u'ORGANIZATION'), (u'POI', u'O'), (u'Short', u'O'), (u'Term', u'O')]

很明显，像Short和Term这样的东西被标记为NNP.我所拥有的数据包含许多这样的实例，其中非NNP单词为大写.这可能是由于拼写错误或它们是标题.我对此没有太多控制权.

so clearly, things like Short and Term were tagged as NNP. The data that i have contains many such instances where non NNP words are capitalized. This might be due to typos or maybe they are headers. I dont have much control over that.

我如何解析或清理数据，以便即使它可以大写也可以检测到非NNP项? 我不希望将Short和Term之类的术语归类为NNP

How can i parse or clean up the data so that i can detect a non NNP term even though it may be capitalized? I dont want terms like Short and Term to be categorized as NNP

此外，不确定为什么John Donk被捕获为一个人而Brian Jones未被捕获为人.可能是由于我数据中其他大写的非NNP引起的吗?会影响StanfordNERTagger如何对待其他一切吗?

Also, not sure why John Donk was captured as a person but Brian Jones was not. Could it be due to the other capitalized non NNPs in my data? Could that be having an effect on how the StanfordNERTagger treats everything else?

更新，一种可能的解决方案

这是我计划要做的事

获取每个单词并转换为小写

标记小写字母

如果标签是NNP，那么我们知道原始单词也必须是NNP

如果不是，则原始单词的大小写错误

Take each word and convert to lower case

Tag the lowercase word

If the tag is NNP then we know that the original word must also be an NNP

If not, then the original word was mis-capitalized

这是我想做的

str = " ".join(words) print str stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') for word in str.split(): wl = word.lower() print wl w,pos = stp.tag(wl) print pos if pos=="NNP": print "Got NNP" print w

但这给我错误

John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics john Traceback (most recent call last): File "X:\crp.py", line 37, in <module> w,pos = stp.tag(wl) ValueError: too many values to unpack

我尝试了多种方法，但总是会出现一些错误. 我如何标记一个单词?

i have tried multiple approaches but some error always shows up. How can i Tag a single word?

我不想将整个字符串转换为小写，然后转换为Tag.如果我这样做，StanfordPOSTagger返回一个空字符串

I dont want to convert the whole string to lower case and then Tag. If i do that, the StanfordPOSTagger returns an empty string

推荐答案

首先，请参见另一个问题，以设置要从命令行或python调用的Stanford CoreNLP:.

Firstly, see your other question to setup Stanford CoreNLP to be called from command-line or python: nltk : How to prevent stemming of proper nouns.

对于适当的大小写句子，我们看到NER可以正常工作:

For the proper cased sentence we see that the NER works properly:

>>> from corenlp import StanfordCoreNLP >>> nlp = StanfordCoreNLP('localhost:9000') >>> text = ('John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics. ' ... 'john donk works poi jones wants meet xyz corp measuring poi short term performance metrics') >>> output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner', 'outputFormat': 'json'}) >>> annotated_sent0 = output['sentences'][0] >>> annotated_sent1 = output['sentences'][1] >>> for token in annotated_sent0['tokens']: ... print token['word'], token['lemma'], token['pos'], token['ner'] ... John John NNP PERSON Donk Donk NNP PERSON works work VBZ O POI POI NNP ORGANIZATION Jones Jones NNP ORGANIZATION wants want VBZ O meet meet VB O Xyz Xyz NNP ORGANIZATION Corp Corp NNP ORGANIZATION measuring measure VBG O POI poi NN O short short JJ O term term NN O performance performance NN O metrics metric NNS O . . . O

对于小写的句子，您将不会获得NNP的POS标签或任何NER标签:

And for the lowered cased sentence, you will not get NNP for POS tag nor any NER tag:

>>> for token in annotated_sent1['tokens']: ... print token['word'], token['lemma'], token['pos'], token['ner'] ... john john NN O donk donk JJ O works work NNS O poi poi VBP O jones jone NNS O wants want VBZ O meet meet VB O xyz xyz NN O corp corp NN O measuring measure VBG O poi poi NN O short short JJ O term term NN O performance performance NN O metrics metric NNS O

所以您的问题应该是:

您的NLP应用程序的最终目标是什么?
为什么输入的内容小写?是您做的还是提供数据的方式?

What is the ultimate aim of your NLP application?
Why is your input lower-cased? Was it your doing or how the data was provided?

回答了这些问题后，您可以继续确定NER标签的真正用途，即

And after answering those questions, you can move on to decide what you really want to do with the NER tags, i.e.

如果输入是小写字母，并且是由于构造NLP工具链的原因，那么

请勿这样做！对普通文本执行NER，而不会造成您的变形.这是因为NER对普通文本进行了培训，因此它不会在普通文本的上下文之外真正发挥作用.
也不要尝试将它与来自不同套件的NLP工具混用，它们通常不能很好地发挥作用，尤其是在您的NLP工具链的末尾.

DO NOT do that!!! Perform the NER on the normal text without distortions you've created. It's because the NER was trained on normal text so it won't really work out of the context of normal text.
Also try to not mix it NLP tools from different suites they will usually not play nice, especially at the end of your NLP tool chain

如果输入是小写的，因为这就是原始数据的方式，则: