我不确定我在这里缺少什么,我认为快点可以解决我的记忆问题。我有100多个以.pickle格式保存的熊猫数据框。我想将它们全部放在同一数据帧中,但一直遇到内存问题。我已经在jupyter中增加了内存缓冲区。似乎在创建dask数据框时可能会丢失一些东西,因为它似乎在完全填满RAM后可能使笔记本电脑崩溃了(也许)。
I'm not sure what I'm missing here, I thought dask would resolve my memory issues. I have 100+ pandas dataframes saved in .pickle format. I would like to get them all in the same dataframe but keep running into memory issues. I've already increased the memory buffer in jupyter. It seems I may be missing something in creating the dask dataframe as it appears to crash my notebook after completely filling my RAM (maybe). Any pointers?
以下是我使用的基本过程:
Below is the basic process I used:
import pandas as pd import dask.dataframe as dd ddf = dd.from_pandas(pd.read_pickle('first.pickle'),npartitions = 8) for pickle_file in all_pickle_files: ddf = ddf.append(pd.read_pickle(pickle_file)) ddf.to_parquet('alldata.parquet', engine='pyarrow')
- 我尝试了各种 npartitions ,但是没有数字允许代码完成
- 总共要合并大约30GB的腌制数据帧
- 也许这不是正确的库,但文档建议dask应该能够处理此问题
- I've tried a variety of npartitions but no number has allowed the code to finish running.
- all in all there is about 30GB of pickled dataframes I'd like to combine
- perhaps this is not the right library but the docs suggest dask should be able to handle this 推荐答案
您是否考虑过先转换 pickle 文件复制到 parquet ,然后加载到dask?我假设您所有的数据都在一个名为 raw 的文件夹中,并且您想移至已处理的
Have you considered to first convert the pickle files to parquet and then load to dask? I assume that all your data is in a folder called raw and you want to move to processed
import pandas as pd import dask.dataframe as dd import os def convert_to_parquet(fn, fldr_in, fldr_out): fn_out = fn.replace(fldr_in, fldr_out)\ .replace(".pickle", ".parquet") df = pd.read_pickle(fn) # eventually change dtypes df.to_parquet(fn_out, index=False) fldr_in = 'data' fldr_out = 'processed' os.makedirs(fldr_out, exist_ok=True) # you could use glob if you prefer fns = os.listdir(fldr_in) fns = [os.path.join(fldr_in, fn) for fn in fns]在内存中超过一个文件的大小,您应该使用循环
If you know than no more than one file fits in memory you should use a loop
for fn in fns: convert_to_parquet(fn, fldr_in, fldr_out)如果您知道内存中可以容纳更多文件,则可以使用 delayed
If you know that more files fit in memory you can use delayed
from dask import delayed, compute # this is lazy out = [delayed(fun)(fn) for fn in fns] # now you are actually converting out = compute(out)现在您可以使用dask进行分析了。
Now you can use dask to do your analysis.
更多推荐
pandas 数据框太大而无法追加到dask数据框?
发布评论