使用plyr,doMC和summarize()与非常大的数据集?

编程入门 行业动态 更新时间:2024-10-17 15:21:05
本文介绍了使用plyr,doMC和summarize()与非常大的数据集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有一个相当大的数据集(〜1.4m行),我正在做一些拆分和总结。整个事情需要一段时间运行,我的最终应用程序依赖于频​​繁运行,所以我的想法是使用 doMC 和 .parallel = TRUE 使用plyr标记的标记如下所示(简化了一下):

I have a fairly large dataset (~1.4m rows) that I'm doing some splitting and summarizing on. The whole thing takes a while to run, and my final application depends on frequent running, so my thought was to use doMC and the .parallel=TRUE flag with plyr like so (simplified a bit):

library(plyr) require(doMC) registerDoMC() df <- ddply(df, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)

如果我将内核显式设置为两个(使用 registerDoMC (核心= 2))我的8 GB的RAM看透我,它刮了相当多的时间。然而,如果我让它使用所有8个内核,我很快耗尽内存,因为每个分支进程似乎克隆整个数据集在内存中。

If I set the number of cores explicitly to two (using registerDoMC(cores=2)) my 8 GB of RAM see me through, and it shaves a decent amount of time. However, if I let it use all 8 cores, I quickly run out of memory due to the fact that each of the forked processes appears to clone the entire dataset in memory.

我的问题是,是否可以使用plyr的并行执行设施在更节省内存的方式?我尝试将我的数据帧转换为 big.matrix ,但这只是似乎迫使整个事情回到使用单个核心:

My question is whether or not it is possible to use plyr's parallel execution facilities in a more memory-thrifty way? I tried converting my dataframe to a big.matrix, but this simply seemed to force the whole thing back to using a single core:

library(plyr) library(doMC) registerDoMC() library(bigmemory) bm <- as.big.matrix(df) df <- mdply(bm, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)

这是我第一次进入多核R计算,所以如果有更好的思考方式,

This is my first foray into multicore R computing, so if there is a better way of thinking about this, I'm open to suggestion.

更新:与生活中的许多事情一样,原来我在做其他愚蠢的事情在我的代码的其他地方,并且多处理的整个问题在这个特定实例中变成了问题点。但是,对于大数据折叠任务,我将保留 data.table 。我能够以一种直接的方式复制我的折叠任务。

UPDATE: As with many things in life, it turns out I was doing Other Stupid Things elsewhere in my code, and that the whole issue of multi-processing becomes a moot point in this particular instance. However, for big data folding tasks, I'll keep data.table in mind. I was able to replicate my folding task in a straightforward way.

推荐答案

我不认为plyr复制整个数据集。但是,当处理数据块时,该子集将复制到工作线程。因此,当使用更多的工作者时,更多的子集同时在内存中(即8而不是2)。

I do not think that plyr makes copies of the entire dataset. However, when processing a chunk of data, that subset is copied to the worker. Therefore, when using more workers, more subsets are in memory simultaneously (i.e. 8 instead of 2).

我可以想到几个你可以尝试的提示:

I can think of a few tips you could try:

  • 将数据放入数组结构中,而不是data.frame中,并使用adply进行汇总。数组在内存使用和速度方面更加高效。
  • 尝试使用 data.table ,在某些情况下,这可能会导致速度增加几个数量级。我不知道data.table是否支持并行处理,但即使没有并行化,data.table可能会更快的时间。请参阅我的博客帖子 a>比较 ave , ddply 和 data.table 处理数据块。
  • Put your data in to an array structure in stead of a data.frame and use adply to do the summarizing. arrays are much more efficient in terms of memory use and speed. I mean using normal matrices, not big.matrix.
  • Give data.table a try, in some cases this can lead to a speed increase of several orders of magnitude. I'm not sure if data.table supports parallel processing, but even without parallelization, data.table might be hunderds of times faster. See a blog post of mine comparing ave, ddply and data.table for processing chunks of data.

更多推荐

使用plyr,doMC和summarize()与非常大的数据集?

本文发布于:2023-11-12 20:04:45,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1582410.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:非常大   数据   plyr   doMC   summarize

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!