我有一个大的(大约 1200 万行)数据帧 df 说:
I have a large (about 12M rows) dataframe df with say:
df.columns = ['word','documents','frequency']所以以下内容及时运行:
So the following ran in a timely fashion:
word_grouping = df[['word','frequency']].groupby('word') MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index() MaxFrequency_perWord.columns = ['word','MaxFrequency']但是,这需要很长时间才能运行:
However, this is taking an unexpected long time to run:
Occurrences_of_Words = word_grouping[['word']].count().reset_index()我在这里做错了什么?有没有更好的方法来计算大型数据帧中的出现次数?
What am I doing wrong here? Is there a better way to count occurences in a large dataframe?
df.word.describe()运行得很好,所以我真的没想到这个 Occurrences_of_Words 数据框需要很长时间来构建.
ran pretty well, so I really did not expect this Occurrences_of_Words dataframe to take very long to build.
ps:如果答案是显而易见的,并且你觉得有必要惩罚我提出这个问题,请附上答案.谢谢.
ps: If the answer is obvious and you feel the need to penalize me for asking this question, please include the answer as well. thank you.
推荐答案我认为 df['word'].value_counts() 应该服务.通过跳过 groupby 机制,您将节省一些时间.我不确定为什么 count 应该比 max 慢得多.两者都需要一些时间来避免缺失值.(对比size.)
I think df['word'].value_counts() should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)
无论如何,value_counts 已经专门优化来处理对象类型,比如你的话,所以我怀疑你会做得更好.
In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.
更多推荐
计算 pandas 出现次数的最有效方法是什么?
发布评论