by()函数是否使列表不断增加

编程入门行业动态更新时间:2024-10-28 16:28:15

本文介绍了by()函数是否使列表不断增加的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

by函数是否使列表一次增加一个元素?

Does the by function make a list that grows one element at a time?

我需要处理一个数据帧，其中约有4M观察值按因子列分组.情况类似于以下示例:

I need to process a data frame with about 4M observations grouped by a factor column. The situation is similar to the example below:

> # Make 4M rows of data > x = data.frame(col1=1:4000000, col2=10000001:14000000) > # Make a factor > x[,"f"] = x[,"col1"] - x[,"col1"] %% 5 > > head(x) col1 col2 f 1 1 10000001 0 2 2 10000002 0 3 3 10000003 0 4 4 10000004 0 5 5 10000005 5 6 6 10000006 5

现在，其中一列上的tapply将花费相当长的时间:

Now, a tapply on one of the columns takes a reasonable amount of time:

> t1 = Sys.time() > z = tapply(x[, 1], x[, "f"], mean) > Sys.time() - t1 Time difference of 22.14491 secs

但是，如果我这样做:

z = by(x[, 1], x[, "f"], mean)

那几乎不可能在同一时间完成(我在一分钟后就放弃了).

That doesn't finish anywhere near the same time (I gave up after a minute).

当然，在上面的示例中，可以使用tapply，但是我实际上需要一起处理多个列.更好的方法是什么?

Of course, in the above example, tapply could be used, but I actually need to process multiple columns together. What is the better way to do this?

推荐答案

by比tapply慢，因为它包装了by. 让我们看一些基准测试:在这种情况下，tapply的速度比使用by

by is slower than tapply because it is wrapping by. Let's take a look at some benchmarks: tapply in this situation is more than 3x faster than using by

已更新以包含@Roland的出色建议:

UPDATED to include @Roland's great recomendation:

library(rbenchmark) library(data.table) dt <- data.table(x,key="f") using.tapply <- quote(tapply(x[, 1], x[, "f"], mean)) using.by <- quote(by(x[, 1], x[, "f"], mean)) using.dtable <- quote(dt[,mean(col1),by=key(dt)]) times <- benchmark(using.tapply, using.dtable, using.by, replications=10, order="relative") times[,c("test", "elapsed", "relative")] #------------------------# # RESULTS # #------------------------# # COMPARING tapply VS by # #----------------------------------- # test elapsed relative # 1 using.tapply 2.453 1.000 # 2 using.by 8.889 3.624 # COMPARING data.table VS tapply VS by # #------------------------------------------# # test elapsed relative # 2 using.dtable 0.168 1.000 # 1 using.tapply 2.396 14.262 # 3 using.by 8.566 50.988

如果x $ f是一个因数，tapply和by之间的效率损失会更大！

尽管请注意，相对于非要素输入，它们都有所改善，而data.table保持大致相同或更差

x[, "f"] <- as.factor(x[, "f"]) dt <- data.table(x,key="f") times <- benchmark(using.tapply, using.dtable, using.by, replications=10, order="relative") times[,c("test", "elapsed", "relative")] # test elapsed relative # 2 using.dtable 0.175 1.000 # 1 using.tapply 1.803 10.303 # 3 using.by 7.854 44.880

?by:

说明

Function by是一个面向对象的包装器，用于轻触地应用于数据帧.

Function by is an object-oriented wrapper for tapply applied to data frames.

让我们看一下by(或更具体地说，是by.data.frame)的来源:

let's take a look at the source for by (or more specificaly, by.data.frame):

by.data.frame function (data, INDICES, FUN, ..., simplify = TRUE) { if (!is.list(INDICES)) { IND <- vector("list", 1L) IND[[1L]] <- INDICES names(IND) <- deparse(substitute(INDICES))[1L] } else IND <- INDICES FUNx <- function(x) FUN(data[x, , drop = FALSE], ...) nd <- nrow(data) ans <- eval(substitute(tapply(seq_len(nd), IND, FUNx, simplify = simplify)), data) attr(ans, "call") <- match.call() class(ans) <- "by" ans }

我们立即看到仍然有对tapply的调用以及许多其他功能(包括对deparse(substitute(.))和eval(substitute(.))的调用，两者都相对较慢).因此，与by类似调用相比，您的tapply相对更快.

We see immediately that there is still a call to tapply plus a lot of extras (including calls to deparse(substitute(.)) and an eval(substitute(.)) both of which are relatively slow). Therefore it makes sense that your tapply will be relatively faster than a similar call to by.

更多推荐

by()函数是否使列表不断增加

本文发布于:2023-10-28 13:57:28，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1536824.html