pandas 数据框太大而无法追加到dask数据框?

编程入门 行业动态 更新时间:2024-10-09 02:30:54
本文介绍了 pandas 数据框太大而无法追加到dask数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我不确定我在这里缺少什么,我认为快点可以解决我的记忆问题。我有100多个以.pickle格式保存的熊猫数据框。我想将它们全部放在同一数据帧中,但一直遇到内存问题。我已经在jupyter中增加了内存缓冲区。似乎在创建dask数据框时可能会丢失一些东西,因为它似乎在完全填满RAM后可能使笔记本电脑崩溃了(也许)。

I'm not sure what I'm missing here, I thought dask would resolve my memory issues. I have 100+ pandas dataframes saved in .pickle format. I would like to get them all in the same dataframe but keep running into memory issues. I've already increased the memory buffer in jupyter. It seems I may be missing something in creating the dask dataframe as it appears to crash my notebook after completely filling my RAM (maybe). Any pointers?

以下是我使用的基本过程:

Below is the basic process I used:

import pandas as pd import dask.dataframe as dd ddf = dd.from_pandas(pd.read_pickle('first.pickle'),npartitions = 8) for pickle_file in all_pickle_files: ddf = ddf.append(pd.read_pickle(pickle_file)) ddf.to_parquet('alldata.parquet', engine='pyarrow')

  • 我尝试了各种 npartitions ,但是没有数字允许代码完成
  • 总共要合并大约30GB的腌制数据帧
  • 也许这不是正确的库,但文档建议dask应该能够处理此问题
    • I've tried a variety of npartitions but no number has allowed the code to finish running.
    • all in all there is about 30GB of pickled dataframes I'd like to combine
    • perhaps this is not the right library but the docs suggest dask should be able to handle this
    • 推荐答案

      您是否考虑过先转换 pickle 文件复制到 parquet ,然后加载到dask?我假设您所有的数据都在一个名为 raw 的文件夹中,并且您想移至已处理的

      Have you considered to first convert the pickle files to parquet and then load to dask? I assume that all your data is in a folder called raw and you want to move to processed

      import pandas as pd import dask.dataframe as dd import os def convert_to_parquet(fn, fldr_in, fldr_out): fn_out = fn.replace(fldr_in, fldr_out)\ .replace(".pickle", ".parquet") df = pd.read_pickle(fn) # eventually change dtypes df.to_parquet(fn_out, index=False) fldr_in = 'data' fldr_out = 'processed' os.makedirs(fldr_out, exist_ok=True) # you could use glob if you prefer fns = os.listdir(fldr_in) fns = [os.path.join(fldr_in, fn) for fn in fns]

      在内存中超过一个文件的大小,您应该使用循环

      If you know than no more than one file fits in memory you should use a loop

      for fn in fns: convert_to_parquet(fn, fldr_in, fldr_out)

      如果您知道内存中可以容纳更多文件,则可以使用 delayed

      If you know that more files fit in memory you can use delayed

      from dask import delayed, compute # this is lazy out = [delayed(fun)(fn) for fn in fns] # now you are actually converting out = compute(out)

      现在您可以使用dask进行分析了。

      Now you can use dask to do your analysis.

更多推荐

pandas 数据框太大而无法追加到dask数据框?

本文发布于:2023-11-22 08:03:29,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1616604.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:数据   太大   pandas   dask

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!