为什么在具有多个组的大型数据框上拆分效率低下?(Why is split inefficient on large data frames with many groups?)

编程入门 行业动态 更新时间:2024-10-09 01:22:15
为什么在具有多个组的大型数据框上拆分效率低下?(Why is split inefficient on large data frames with many groups?) df %>% split(.$x)

x的大量唯一值变得缓慢。 如果我们将数据帧手动分割成更小的子集,然后在每个子集上执行分割,我们至少将时间减少一个数量级。

library(dplyr) library(microbenchmark) library(caret) library(purrr) N <- 10^6 groups <- 10^5 df <- data.frame(x = sample(1:groups, N, replace = TRUE), y = sample(letters, N, replace = TRUE)) ids <- df$x %>% unique folds10 <- createFolds(ids, 10) folds100 <- createFolds(ids, 100)

运行microbenchmark给我们

## Unit: seconds ## expr mean l1 <- df %>% split(.$x) # 242.11805 l2 <- lapply(folds10, function(id) df %>% filter(x %in% id) %>% split(.$x)) %>% flatten # 50.45156 l3 <- lapply(folds100, function(id) df %>% filter(x %in% id) %>% split(.$x)) %>% flatten # 12.83866

split不是为大型团体设计的吗? 除了手动初始子集之外,还有其他选择吗?

我的笔记本电脑是2013年末的MacBook Pro,2.4GHz 8GB

df %>% split(.$x)

becomes slow for large number of unique values of x. If we instead split the data frame manually into smaller subsets and then perform split on each subset we reduce the time by at least an order of magnitude.

library(dplyr) library(microbenchmark) library(caret) library(purrr) N <- 10^6 groups <- 10^5 df <- data.frame(x = sample(1:groups, N, replace = TRUE), y = sample(letters, N, replace = TRUE)) ids <- df$x %>% unique folds10 <- createFolds(ids, 10) folds100 <- createFolds(ids, 100)

Running microbenchmark gives us

## Unit: seconds ## expr mean l1 <- df %>% split(.$x) # 242.11805 l2 <- lapply(folds10, function(id) df %>% filter(x %in% id) %>% split(.$x)) %>% flatten # 50.45156 l3 <- lapply(folds100, function(id) df %>% filter(x %in% id) %>% split(.$x)) %>% flatten # 12.83866

Is split not designed for large groups? Are there any alternatives besides the manual initial subsetting?

My laptop is a macbook pro late 2013, 2.4GHz 8GB

最满意答案

这不是严格的split.data.frame问题,对于许多组来说data.frame的可伸缩性存在一个更普遍的问题。 如果您使用split.data.table您可以获得相当不错的加速。 我在常规的data.table方法的基础上开发了这种方法,并且在这里看起来很好。

system.time( l1 <- df %>% split(.$x) ) # user system elapsed #200.936 0.000 217.496 library(data.table) dt = as.data.table(df) system.time( l2 <- split(dt, by="x") ) # user system elapsed # 7.372 0.000 6.875 system.time( l3 <- split(dt, by="x", sorted=TRUE) ) # user system elapsed # 9.068 0.000 8.200

sorted=TRUE将返回与data.frame方法相同顺序的列表,默认情况下,data.table方法将保留输入数据中的顺序。 如果你想坚持data.frame你可以在最后使用lapply(l2, setDF) 。

PS。 split.data.table被添加到1.9.7中,devel版本的安装非常简单

install.packages("data.table", type="source", repos="http://Rdatatable.github.io/data.table")

有关安装wiki的更多信息。

This isn't strictly split.data.frame issue, there is a more general problem on scalability of data.frame for many groups. You can get pretty nice speed up if you use split.data.table. I developed this method on top of regular data.table methods and it seems to scale pretty well here.

system.time( l1 <- df %>% split(.$x) ) # user system elapsed #200.936 0.000 217.496 library(data.table) dt = as.data.table(df) system.time( l2 <- split(dt, by="x") ) # user system elapsed # 7.372 0.000 6.875 system.time( l3 <- split(dt, by="x", sorted=TRUE) ) # user system elapsed # 9.068 0.000 8.200

sorted=TRUE will return the list of the same order as data.frame method, by default data.table method will preserve order present in input data. If you want to stick to data.frame you can at the end use lapply(l2, setDF).

PS. split.data.table was added in 1.9.7, installation of devel version is pretty simple

install.packages("data.table", type="source", repos="http://Rdatatable.github.io/data.table")

More about that in Installation wiki.

更多推荐

本文发布于:2023-07-27 04:01:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1284863.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:多个   低下   效率   数据   split

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!