为什么在具有多个组的大型数据框上拆分效率低下？(Why is split inefficient on large data frames with many groups?)

为什么在具有多个组的大型数据框上拆分效率低下？(Why is split inefficient on large data frames with many groups?) df %>% split(.$x)

x的大量唯一值变得缓慢。如果我们将数据帧手动分割成更小的子集，然后在每个子集上执行分割，我们至少将时间减少一个数量级。

library(dplyr) library(microbenchmark) library(caret) library(purrr) N <- 10^6 groups <- 10^5 df <- data.frame(x = sample(1:groups, N, replace = TRUE), y = sample(letters, N, replace = TRUE)) ids <- df$x %>% unique folds10 <- createFolds(ids, 10) folds100 <- createFolds(ids, 100)

运行microbenchmark给我们

## Unit: seconds ## expr mean l1 <- df %>% split(.$x) # 242.11805 l2 <- lapply(folds10, function(id) df %>% filter(x %in% id) %>% split(.$x)) %>% flatten # 50.45156 l3 <- lapply(folds100, function(id) df %>% filter(x %in% id) %>% split(.$x)) %>% flatten # 12.83866

split不是为大型团体设计的吗？除了手动初始子集之外，还有其他选择吗？

我的笔记本电脑是2013年末的MacBook Pro，2.4GHz 8GB

df %>% split(.$x)

becomes slow for large number of unique values of x. If we instead split the data frame manually into smaller subsets and then perform split on each subset we reduce the time by at least an order of magnitude.

library(dplyr) library(microbenchmark) library(caret) library(purrr) N <- 10^6 groups <- 10^5 df <- data.frame(x = sample(1:groups, N, replace = TRUE), y = sample(letters, N, replace = TRUE)) ids <- df$x %>% unique folds10 <- createFolds(ids, 10) folds100 <- createFolds(ids, 100)

Running microbenchmark gives us

## Unit: seconds ## expr mean l1 <- df %>% split(.$x) # 242.11805 l2 <- lapply(folds10, function(id) df %>% filter(x %in% id) %>% split(.$x)) %>% flatten # 50.45156 l3 <- lapply(folds100, function(id) df %>% filter(x %in% id) %>% split(.$x)) %>% flatten # 12.83866

Is split not designed for large groups? Are there any alternatives besides the manual initial subsetting?

My laptop is a macbook pro late 2013, 2.4GHz 8GB

最满意答案

这不是严格的split.data.frame问题，对于许多组来说data.frame的可伸缩性存在一个更普遍的问题。如果您使用split.data.table您可以获得相当不错的加速。我在常规的data.table方法的基础上开发了这种方法，并且在这里看起来很好。

system.time( l1 <- df %>% split(.$x) ) # user system elapsed #200.936 0.000 217.496 library(data.table) dt = as.data.table(df) system.time( l2 <- split(dt, by="x") ) # user system elapsed # 7.372 0.000 6.875 system.time( l3 <- split(dt, by="x", sorted=TRUE) ) # user system elapsed # 9.068 0.000 8.200

sorted=TRUE将返回与data.frame方法相同顺序的列表，默认情况下，data.table方法将保留输入数据中的顺序。如果你想坚持data.frame你可以在最后使用lapply(l2, setDF) 。

PS。 split.data.table被添加到1.9.7中，devel版本的安装非常简单

install.packages("data.table", type="source", repos="http://Rdatatable.github.io/data.table")

有关安装wiki的更多信息。

This isn't strictly split.data.frame issue, there is a more general problem on scalability of data.frame for many groups. You can get pretty nice speed up if you use split.data.table. I developed this method on top of regular data.table methods and it seems to scale pretty well here.

system.time( l1 <- df %>% split(.$x) ) # user system elapsed #200.936 0.000 217.496 library(data.table) dt = as.data.table(df) system.time( l2 <- split(dt, by="x") ) # user system elapsed # 7.372 0.000 6.875 system.time( l3 <- split(dt, by="x", sorted=TRUE) ) # user system elapsed # 9.068 0.000 8.200

sorted=TRUE will return the list of the same order as data.frame method, by default data.table method will preserve order present in input data. If you want to stick to data.frame you can at the end use lapply(l2, setDF).

PS. split.data.table was added in 1.9.7, installation of devel version is pretty simple

install.packages("data.table", type="source", repos="http://Rdatatable.github.io/data.table")

More about that in Installation wiki.

更多推荐