在R中,我试图使用不同的窗口宽度对大型矢量(最多40万个元素)进行非常快速的滚动平均,然后针对每个窗口宽度按每年的最大值汇总数据。希望下面的例子很清楚。 我尝试了几种方法,到目前为止最快的方法似乎是使用软件包 RcppRoll 表示平均值,总计表示最大值。 请注意,内存需求是一个问题:下面的版本需要很少的内存,因为它一次只执行一次滚动平均值和汇总;
In R, I am trying to do a very fast rolling mean of a large vector (up to 400k elements) using different window widths, then for each window width summarize the data by the maximum of each year. The example below will hopefully be clear. I have tried several approaches, and the fastest up to now seems to be using roll_mean from the package RcppRoll for the running mean, and aggregate for picking the maximum. Please note that memory requirement is a concern: the version below requires very little memory since it does one single rolling mean and aggregation at a time; this is preferred.
#Example data frame of 10k measurements from 2001 to 2014 n <- 100000 df <- data.frame(rawdata=rnorm(n), year=sort(sample(2001:2014, size=n, replace=TRUE)) ) ww <- 1:120 #Vector of window widths dfsumm <- as.data.frame(matrix(nrow=14, ncol=121)) dfsumm[,1] <- 2001:2014 colnames(dfsumm) <- c("year", paste0("D=", ww)) system.time(for (i in 1:length(ww)) { #Do the rolling mean for this ww df$tmp <- roll_mean(df$rawdata, ww[i], na.rm=TRUE, fill=NA) #Aggregate maxima for each year dfsumm[,i+1] <- aggregate(data=df, tmp ~ year, max)[,2] }) #28s on my machine dfsumm这将提供所需的输出:a data.frame 包含15行(从2001年到2015年)和120列(窗口宽度),其中包含每个ww和每年的最大值。
This gives the desired output: a data.frame with 15 rows (years from 2001 to 2015) and 120 columns (the window widths) containing the maximum for each ww and for each year.
但是,计算仍然需要太长时间(因为我必须计算成千上万个)。我尝试过尝试其他选项,例如 dplyr 和 data.table ,但是我一直找不到
However, it still takes too long to compute (as I have to compute thousands of these). I have tried playing around with other options, namely dplyr and data.table, but I've been unable to find something faster due to my lack of knowledge of those packages.
这将是最快的方法,使用单核(代码已经在其他地方并行化了?)
Which would be the fastest way to do this, using a single core (the code is already parallelized elsewhere)?
推荐答案内存管理(即分配和复制)正在用您的方法杀死您。
Memory management, i.e. allocation and copies, is killing you with your approach.
这里是data.table方法,通过引用进行分配:
Here is a data.table approach, which assigns by reference:
library(data.table) setDT(df) alloc.col(df, 200) #allocate sufficient columns #assign rolling means in a loop for (i in seq_along(ww)) set(df, j = paste0("D", i), value = roll_mean(df[["rawdata"]], ww[i], na.rm=TRUE, fill=NA)) dfsumm <- df[, lapply(.SD, max, na.rm = TRUE), by = year] #aggregate更多推荐
快速滚动平均值+总结
发布评论