pandas 数据框太大而无法追加到dask数据框？

编程入门行业动态更新时间:2024-10-09 02:30:54

本文介绍了 pandas 数据框太大而无法追加到dask数据框？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我不确定我在这里缺少什么，我认为快点可以解决我的记忆问题。我有100多个以.pickle格式保存的熊猫数据框。我想将它们全部放在同一数据帧中，但一直遇到内存问题。我已经在jupyter中增加了内存缓冲区。似乎在创建dask数据框时可能会丢失一些东西，因为它似乎在完全填满RAM后可能使笔记本电脑崩溃了（也许）。

I'm not sure what I'm missing here, I thought dask would resolve my memory issues. I have 100+ pandas dataframes saved in .pickle format. I would like to get them all in the same dataframe but keep running into memory issues. I've already increased the memory buffer in jupyter. It seems I may be missing something in creating the dask dataframe as it appears to crash my notebook after completely filling my RAM (maybe). Any pointers?

以下是我使用的基本过程：

Below is the basic process I used:

import pandas as pd import dask.dataframe as dd ddf = dd.from_pandas(pd.read_pickle('first.pickle'),npartitions = 8) for pickle_file in all_pickle_files: ddf = ddf.append(pd.read_pickle(pickle_file)) ddf.to_parquet('alldata.parquet', engine='pyarrow')

我尝试了各种 npartitions ，但是没有数字允许代码完成
总共要合并大约30GB的腌制数据帧
也许这不是正确的库，但文档建议dask应该能够处理此问题

I've tried a variety of npartitions but no number has allowed the code to finish running.
all in all there is about 30GB of pickled dataframes I'd like to combine
perhaps this is not the right library but the docs suggest dask should be able to handle this

您是否考虑过先转换 pickle 文件复制到 parquet ，然后加载到dask？我假设您所有的数据都在一个名为 raw 的文件夹中，并且您想移至已处理的

Have you considered to first convert the pickle files to parquet and then load to dask? I assume that all your data is in a folder called raw and you want to move to processed

在内存中超过一个文件的大小，您应该使用循环

If you know than no more than one file fits in memory you should use a loop

如果您知道内存中可以容纳更多文件，则可以使用 delayed