对非常大的Pandas Data框架进行聚合和统计(Do aggregations and statistics on a very large Pandas Data frame)

我有一个非常大（比便宜的RAM更大）的CSV数据集，我正在用熊猫操作。现在我正在做类似的事情

df = pd.read_csv('verylargefile.csv', chunksize=10000) for df_chunk in df: df_chunk['new_column'] = df_chunk['old_column'].apply(my_func) # do other operations and filters... df_chunk.to_csv('processed.csv', mode='a')

所以我可以做我需要操作数据集并将输出保存到另一个文件。

问题始于对此数据集进行一些分组和统计：我需要计算整个数据集的均值，标准差和直方图，并绘制结果，趋势，使用statsmodels等图表。因为样本不会是同质我无法计算统计数据。

df.groupby('column_1').sum() TypeError: Cannot groupby on a TextFileReader

我没有通常的选择，只选择几列，我不知道如何将数据存储在HDF上会有所帮助。有办法吗？

I have a very large (larger than cheaper RAM) CSV data set that I'm operating with pandas. Now I'm doing something like

df = pd.read_csv('verylargefile.csv', chunksize=10000) for df_chunk in df: df_chunk['new_column'] = df_chunk['old_column'].apply(my_func) # do other operations and filters... df_chunk.to_csv('processed.csv', mode='a')

so I can do what I need to operate on the data set and save the output to another file.

The problem starts with doing some groupings and statistics on this data set: I need to calculate the mean, standard deviation and histograms of the entire data set and also graph the results, tendencies, use statsmodels etc. As the samples are not going to be homogeneous I cannot calculate the statistics.

df.groupby('column_1').sum() TypeError: Cannot groupby on a TextFileReader

I don't have the usual options of selecting only a few columns and I don't see how storing my data on HDF would help. Is there a way?

最满意答案

这看起来像dask.dataframe可以帮助的情况。我在下面列举了一个例子。

有关更多信息，请参阅dask文档和这个非常好的教程。

In [1]: import dask.dataframe as dd In [2]: !head data/accounts.0.csv id,names,amount 138,Ray,1502 1,Tim,5 388,Ingrid,45 202,Zelda,1324 336,Jerry,-1607 456,Laura,-2832 65,Laura,-326 54,Yvonne,341 92,Sarah,3136 In [3]: dask_df = dd.read_csv('data/accounts.0.csv', chunkbytes=4000000) In [4]: dask_df.npartitions Out[4]: 4 In [5]: len(dask_df) Out[5]: 1000000 In [6]: result = dask_df.groupby('names').sum() In [7]: result.compute() Out[7]: id amount names Alice 10282524 43233084 Bob 8617531 47512276 Charlie 8056803 47729638 Dan 10146581 32513817 Edith 15164281 37806024 Frank 11310157 63869156 George 14941235 80436603 Hannah 3006336 25721574 Ingrid 10123877 54152865 Jerry 10317245 8613040 Kevin 6809100 16755317 Laura 9941112 34723055 Michael 11200937 36431387 Norbert 5715799 14482698 Oliver 10423117 32415534 Patricia 15289085 22767501 Quinn 10686459 16083432 Ray 10156429 9455663 Sarah 7977036 34970428 Tim 12283567 47851141 Ursula 4893696 37942347 Victor 8864468 15542688 Wendy 9348077 68824579 Xavier 6600945 -3482124 Yvonne 5665415 12701550 Zelda 8491817 42573021

为了比较，这里是使用pandas的结果。我在这里使用的数据适合内存，但即使数据大于内存， dask也能正常工作。

In [8]: import pandas as pd In [9]: pandas_df = pd.read_csv('data/accounts.0.csv') In [10]: len(pandas_df) Out[10]: 1000000 In [11]: pandas_df.groupby('names').sum() Out[11]: id amount names Alice 10282524 43233084 Bob 8617531 47512276 Charlie 8056803 47729638 Dan 10146581 32513817 Edith 15164281 37806024 Frank 11310157 63869156 George 14941235 80436603 Hannah 3006336 25721574 Ingrid 10123877 54152865 Jerry 10317245 8613040 Kevin 6809100 16755317 Laura 9941112 34723055 Michael 11200937 36431387 Norbert 5715799 14482698 Oliver 10423117 32415534 Patricia 15289085 22767501 Quinn 10686459 16083432 Ray 10156429 9455663 Sarah 7977036 34970428 Tim 12283567 47851141 Ursula 4893696 37942347 Victor 8864468 15542688 Wendy 9348077 68824579 Xavier 6600945 -3482124 Yvonne 5665415 12701550 Zelda 8491817 42573021

This looks like a situation where dask.dataframe could help. I have included an example below.

For more information see the dask documentation and this very nice tutorial.

In [1]: import dask.dataframe as dd In [2]: !head data/accounts.0.csv id,names,amount 138,Ray,1502 1,Tim,5 388,Ingrid,45 202,Zelda,1324 336,Jerry,-1607 456,Laura,-2832 65,Laura,-326 54,Yvonne,341 92,Sarah,3136 In [3]: dask_df = dd.read_csv('data/accounts.0.csv', chunkbytes=4000000) In [4]: dask_df.npartitions Out[4]: 4 In [5]: len(dask_df) Out[5]: 1000000 In [6]: result = dask_df.groupby('names').sum() In [7]: result.compute() Out[7]: id amount names Alice 10282524 43233084 Bob 8617531 47512276 Charlie 8056803 47729638 Dan 10146581 32513817 Edith 15164281 37806024 Frank 11310157 63869156 George 14941235 80436603 Hannah 3006336 25721574 Ingrid 10123877 54152865 Jerry 10317245 8613040 Kevin 6809100 16755317 Laura 9941112 34723055 Michael 11200937 36431387 Norbert 5715799 14482698 Oliver 10423117 32415534 Patricia 15289085 22767501 Quinn 10686459 16083432 Ray 10156429 9455663 Sarah 7977036 34970428 Tim 12283567 47851141 Ursula 4893696 37942347 Victor 8864468 15542688 Wendy 9348077 68824579 Xavier 6600945 -3482124 Yvonne 5665415 12701550 Zelda 8491817 42573021

For comparison here is the result using pandas. The data that I am using here fits in memory, but dask will work even when the data is larger than memory.

In [8]: import pandas as pd In [9]: pandas_df = pd.read_csv('data/accounts.0.csv') In [10]: len(pandas_df) Out[10]: 1000000 In [11]: pandas_df.groupby('names').sum() Out[11]: id amount names Alice 10282524 43233084 Bob 8617531 47512276 Charlie 8056803 47729638 Dan 10146581 32513817 Edith 15164281 37806024 Frank 11310157 63869156 George 14941235 80436603 Hannah 3006336 25721574 Ingrid 10123877 54152865 Jerry 10317245 8613040 Kevin 6809100 16755317 Laura 9941112 34723055 Michael 11200937 36431387 Norbert 5715799 14482698 Oliver 10423117 32415534 Patricia 15289085 22767501 Quinn 10686459 16083432 Ray 10156429 9455663 Sarah 7977036 34970428 Tim 12283567 47851141 Ursula 4893696 37942347 Victor 8864468 15542688 Wendy 9348077 68824579 Xavier 6600945 -3482124 Yvonne 5665415 12701550 Zelda 8491817 42573021

更多推荐