使用plyr，doMC和summarize（）与非常大的数据集？

编程入门行业动态更新时间:2024-10-17 15:21:05

本文介绍了使用plyr，doMC和summarize（）与非常大的数据集？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有一个相当大的数据集（〜1.4m行），我正在做一些拆分和总结。整个事情需要一段时间运行，我的最终应用程序依赖于频繁运行，所以我的想法是使用 doMC 和 .parallel = TRUE 使用plyr标记的标记如下所示（简化了一下）：

I have a fairly large dataset (~1.4m rows) that I'm doing some splitting and summarizing on. The whole thing takes a while to run, and my final application depends on frequent running, so my thought was to use doMC and the .parallel=TRUE flag with plyr like so (simplified a bit):

library(plyr) require(doMC) registerDoMC() df <- ddply(df, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)

如果我将内核显式设置为两个（使用 registerDoMC （核心= 2））我的8 GB的RAM看透我，它刮了相当多的时间。然而，如果我让它使用所有8个内核，我很快耗尽内存，因为每个分支进程似乎克隆整个数据集在内存中。

If I set the number of cores explicitly to two (using registerDoMC(cores=2)) my 8 GB of RAM see me through, and it shaves a decent amount of time. However, if I let it use all 8 cores, I quickly run out of memory due to the fact that each of the forked processes appears to clone the entire dataset in memory.

我的问题是，是否可以使用plyr的并行执行设施在更节省内存的方式？我尝试将我的数据帧转换为 big.matrix ，但这只是似乎迫使整个事情回到使用单个核心：

My question is whether or not it is possible to use plyr's parallel execution facilities in a more memory-thrifty way? I tried converting my dataframe to a big.matrix, but this simply seemed to force the whole thing back to using a single core:

library(plyr) library(doMC) registerDoMC() library(bigmemory) bm <- as.big.matrix(df) df <- mdply(bm, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)

这是我第一次进入多核R计算，所以如果有更好的思考方式，

This is my first foray into multicore R computing, so if there is a better way of thinking about this, I'm open to suggestion.

更新：与生活中的许多事情一样，原来我在做其他愚蠢的事情在我的代码的其他地方，并且多处理的整个问题在这个特定实例中变成了问题点。但是，对于大数据折叠任务，我将保留 data.table 。我能够以一种直接的方式复制我的折叠任务。

UPDATE: As with many things in life, it turns out I was doing Other Stupid Things elsewhere in my code, and that the whole issue of multi-processing becomes a moot point in this particular instance. However, for big data folding tasks, I'll keep data.table in mind. I was able to replicate my folding task in a straightforward way.

推荐答案

我不认为plyr复制整个数据集。但是，当处理数据块时，该子集将复制到工作线程。因此，当使用更多的工作者时，更多的子集同时在内存中（即8而不是2）。

I do not think that plyr makes copies of the entire dataset. However, when processing a chunk of data, that subset is copied to the worker. Therefore, when using more workers, more subsets are in memory simultaneously (i.e. 8 instead of 2).

我可以想到几个你可以尝试的提示：