将nltk.FreqDist单词分成两个列表?

编程入门行业动态更新时间:2024-10-22 21:42:15

本文介绍了将nltk.FreqDist单词分成两个列表?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有一系列文本，它们是自定义WebText类的实例.每个文本都是一个对象，具有一个与之相关的等级(-10至+10)和一个单词计数(nltk.FreqDist):

I have a series of texts that are instances of a custom WebText class. Each text is an object that has a rating (-10 to +10) and a word count (nltk.FreqDist) associated with it:

>>trainingTexts = [WebText('train1.txt'), WebText('train2.txt'), WebText('train3.txt'), WebText('train4.txt')] >>trainingTexts[1].rating 10 >>trainingTexts[1].freq_dist <FreqDist: 'the': 60, ',': 49, 'to': 38, 'is': 34,...>

您现在如何获得两个列表(或词典)，其中包含专门用于正评分文本的每个单词(trainingText [].rating> 0)，另一个列表包含专门用于负数文本的每个单词(trainingText []).rating< 0).并让每个列表包含所有正面或负面文本的总字数，以便您得到如下内容:

How can you now get two lists (or dictionaries) containing every word used exclusively in the positively rated texts (trainingText[].rating>0), and another list containing every word used exclusively in the negative texts (trainingText[].rating<0). And have each list contain the total word counts for all the positive or negative texts, so that you get something like this:

>>only_positive_words [('sky', 10), ('good', 9), ('great', 2)...] >>only_negative_words [('earth', 10), ('ski', 9), ('food', 2)...]

我考虑使用集合，因为集合包含唯一的实例，但是我看不到如何使用nltk.FreqDist做到这一点，最重要的是，集合不会按字频排序.有什么想法吗?

I considered using sets, as sets contain unique instances, but i can't see how this can be done with nltk.FreqDist, and on top of that, a set wouldn't be ordered by word frequency. Any ideas?

推荐答案

好吧，假设您是出于测试目的而开始的:

Ok, let's say you start with this for the purposes of testing:

class Rated(object): def __init__(self, rating, freq_dist): self.rating = rating self.freq_dist = freq_dist a = Rated(5, nltk.FreqDist('the boy sees the dog'.split())) b = Rated(8, nltk.FreqDist('the cat sees the mouse'.split())) c = Rated(-3, nltk.FreqDist('some boy likes nothing'.split())) trainingTexts = [a,b,c]

然后您的代码如下:

from collections import defaultdict from operator import itemgetter # dictionaries for keeping track of the counts pos_dict = defaultdict(int) neg_dict = defaultdict(int) for r in trainingTexts: rating = r.rating freq = r.freq_dist # choose the appropriate counts dict if rating > 0: partition = pos_dict elif rating < 0: partition = neg_dict else: continue # add the information to the correct counts dict for word,count in freq.iteritems(): partition[word] += count # Turn the counts dictionaries into lists of descending-frequency words def only_list(counts, filtered): return sorted(filter(lambda (w,c): w not in filtered, counts.items()), \ key=itemgetter(1), \ reverse=True) only_positive_words = only_list(pos_dict, neg_dict) only_negative_words = only_list(neg_dict, pos_dict)

结果是:

>>> only_positive_words [('the', 4), ('sees', 2), ('dog', 1), ('cat', 1), ('mouse', 1)] >>> only_negative_words [('nothing', 1), ('some', 1), ('likes', 1)]