如何有效地使用Rprof在R？(How to efficiently use Rprof in R?)

我想知道是否可以以类似于matlab的Profiler的方式从R -Code获取一个配置文件。也就是说，要知道哪一行是特别慢的行数。

到目前为止，我所取得的成就令人满意。我使用Rprof让我成为一个配置文件。使用summaryRprof我得到以下内容：

$by.self self.time self.pct total.time total.pct [.data.frame 0.72 10.1 1.84 25.8 inherits 0.50 7.0 1.10 15.4 data.frame 0.48 6.7 4.86 68.3 unique.default 0.44 6.2 0.48 6.7 deparse 0.36 5.1 1.18 16.6 rbind 0.30 4.2 2.22 31.2 match 0.28 3.9 1.38 19.4 [<-.factor 0.28 3.9 0.56 7.9 levels 0.26 3.7 0.34 4.8 NextMethod 0.22 3.1 0.82 11.5 ...

和

$by.total total.time total.pct self.time self.pct data.frame 4.86 68.3 0.48 6.7 rbind 2.22 31.2 0.30 4.2 do.call 2.22 31.2 0.00 0.0 [ 1.98 27.8 0.16 2.2 [.data.frame 1.84 25.8 0.72 10.1 match 1.38 19.4 0.28 3.9 %in% 1.26 17.7 0.14 2.0 is.factor 1.20 16.9 0.10 1.4 deparse 1.18 16.6 0.36 5.1 ...

说实话，从这个输出我不知道我的瓶颈是因为（a）我经常使用data.frame ，（b）我从来没有使用eg， deparse 。此外， [ ？

所以我尝试了Hadley Wickham的专业profr ，但考虑到以下图表，它没有什么用处：

有更方便的方法来查看哪些行号和特定的函数调用缓慢吗？ 还是有一些我应该咨询的文献？

任何提示赞赏。

编辑1： 基于Hadley的评论，我将粘贴下面的脚本的代码和绘图的基本图形版本。但请注意，我的问题与此特定脚本无关。这只是我最近写的一个随机的脚本。 我正在寻找一个如何找到瓶颈并加快R code的一般方法。

数据（ x ）如下所示：

type word response N Classification classN Abstract ANGER bitter 1 3a 3a Abstract ANGER control 1 1a 1a Abstract ANGER father 1 3a 3a Abstract ANGER flushed 1 3a 3a Abstract ANGER fury 1 1c 1c Abstract ANGER hat 1 3a 3a Abstract ANGER help 1 3a 3a Abstract ANGER mad 13 3a 3a Abstract ANGER management 2 1a 1a ... until row 1700

脚本（简短的解释）是这样的：

Rprof("profile1.out") # A new dataset is produced with each line of x contained x$N times y <- vector('list',length(x[,1])) for (i in 1:length(x[,1])) { y[[i]] <- data.frame(rep(x[i,1],x[i,"N"]),rep(x[i,2],x[i,"N"]),rep(x[i,3],x[i,"N"]),rep(x[i,4],x[i,"N"]),rep(x[i,5],x[i,"N"]),rep(x[i,6],x[i,"N"])) } all <- do.call('rbind',y) colnames(all) <- colnames(x) # create a dataframe out of a word x class table table_all <- table(all$word,all$classN) dataf.all <- as.data.frame(table_all[,1:length(table_all[1,])]) dataf.all$words <- as.factor(rownames(dataf.all)) dataf.all$type <- "no" # get type of the word. words <- levels(dataf.all$words) for (i in 1:length(words)) { dataf.all$type[i] <- as.character(all[pmatch(words[i],all$word),"type"]) } dataf.all$type <- as.factor(dataf.all$type) dataf.all$typeN <- as.numeric(dataf.all$type) # aggregate response categories dataf.all$c1 <- apply(dataf.all[,c("1a","1b","1c","1d","1e","1f")],1,sum) dataf.all$c2 <- apply(dataf.all[,c("2a","2b","2c")],1,sum) dataf.all$c3 <- apply(dataf.all[,c("3a","3b")],1,sum) Rprof(NULL) library(profr) ggplot.profr(parse_rprof("profile1.out"))

最终数据如下所示：

1a 1b 1c 1d 1e 1f 2a 2b 2c 3a 3b pa words type typeN c1 c2 c3 pa 3 0 8 0 0 0 0 0 0 24 0 0 ANGER Abstract 1 11 0 24 0 6 0 4 0 1 0 0 11 0 13 0 0 ANXIETY Abstract 1 11 11 13 0 2 11 1 0 0 0 0 4 0 17 0 0 ATTITUDE Abstract 1 14 4 17 0 9 18 0 0 0 0 0 0 0 0 8 0 BARREL Concrete 2 27 0 8 0 0 1 18 0 0 0 0 4 0 12 0 0 BELIEF Abstract 1 19 4 12 0

基线图：

今天运行脚本也改变了ggplot2图表（基本上只是标签），看到这里。

I would like to know if it is possible to get a profile from R-Code in a way that is similar to matlab's Profiler. That is, to get to know which line numbers are the one's that are especially slow.

What I acchieved so far is somehow not satisfactory. I used Rprof to make me a profile file. Using summaryRprof I get something like the following:

$by.self self.time self.pct total.time total.pct [.data.frame 0.72 10.1 1.84 25.8 inherits 0.50 7.0 1.10 15.4 data.frame 0.48 6.7 4.86 68.3 unique.default 0.44 6.2 0.48 6.7 deparse 0.36 5.1 1.18 16.6 rbind 0.30 4.2 2.22 31.2 match 0.28 3.9 1.38 19.4 [<-.factor 0.28 3.9 0.56 7.9 levels 0.26 3.7 0.34 4.8 NextMethod 0.22 3.1 0.82 11.5 ...

and

$by.total total.time total.pct self.time self.pct data.frame 4.86 68.3 0.48 6.7 rbind 2.22 31.2 0.30 4.2 do.call 2.22 31.2 0.00 0.0 [ 1.98 27.8 0.16 2.2 [.data.frame 1.84 25.8 0.72 10.1 match 1.38 19.4 0.28 3.9 %in% 1.26 17.7 0.14 2.0 is.factor 1.20 16.9 0.10 1.4 deparse 1.18 16.6 0.36 5.1 ...

To be honest, from this output I don't get where my bottlenecks are because (a) I use data.frame pretty often and (b) I never use e.g., deparse. Furthermore, what is [?

So I tried Hadley Wickham's profr, but it was not any more useful considering the following graph:

Is there a more convenient way to see which line numbers and particular function calls are slow? Or, is there some literature that I should consult?

Any hints appreciated.

EDIT 1: Based on Hadley's comment I will paste the code of my script below and the base graph version of the plot. But note, that my question is not related to this specific script. It is just a random script that I recently wrote. I am looking for a general way of how to find bottlenecks and speed up R-code.

The data (x) looks like this:

type word response N Classification classN Abstract ANGER bitter 1 3a 3a Abstract ANGER control 1 1a 1a Abstract ANGER father 1 3a 3a Abstract ANGER flushed 1 3a 3a Abstract ANGER fury 1 1c 1c Abstract ANGER hat 1 3a 3a Abstract ANGER help 1 3a 3a Abstract ANGER mad 13 3a 3a Abstract ANGER management 2 1a 1a ... until row 1700

The script (with short explanations) is this:

Rprof("profile1.out") # A new dataset is produced with each line of x contained x$N times y <- vector('list',length(x[,1])) for (i in 1:length(x[,1])) { y[[i]] <- data.frame(rep(x[i,1],x[i,"N"]),rep(x[i,2],x[i,"N"]),rep(x[i,3],x[i,"N"]),rep(x[i,4],x[i,"N"]),rep(x[i,5],x[i,"N"]),rep(x[i,6],x[i,"N"])) } all <- do.call('rbind',y) colnames(all) <- colnames(x) # create a dataframe out of a word x class table table_all <- table(all$word,all$classN) dataf.all <- as.data.frame(table_all[,1:length(table_all[1,])]) dataf.all$words <- as.factor(rownames(dataf.all)) dataf.all$type <- "no" # get type of the word. words <- levels(dataf.all$words) for (i in 1:length(words)) { dataf.all$type[i] <- as.character(all[pmatch(words[i],all$word),"type"]) } dataf.all$type <- as.factor(dataf.all$type) dataf.all$typeN <- as.numeric(dataf.all$type) # aggregate response categories dataf.all$c1 <- apply(dataf.all[,c("1a","1b","1c","1d","1e","1f")],1,sum) dataf.all$c2 <- apply(dataf.all[,c("2a","2b","2c")],1,sum) dataf.all$c3 <- apply(dataf.all[,c("3a","3b")],1,sum) Rprof(NULL) library(profr) ggplot.profr(parse_rprof("profile1.out"))

Final data looks like this:

1a 1b 1c 1d 1e 1f 2a 2b 2c 3a 3b pa words type typeN c1 c2 c3 pa 3 0 8 0 0 0 0 0 0 24 0 0 ANGER Abstract 1 11 0 24 0 6 0 4 0 1 0 0 11 0 13 0 0 ANXIETY Abstract 1 11 11 13 0 2 11 1 0 0 0 0 4 0 17 0 0 ATTITUDE Abstract 1 14 4 17 0 9 18 0 0 0 0 0 0 0 0 8 0 BARREL Concrete 2 27 0 8 0 0 1 18 0 0 0 0 4 0 12 0 0 BELIEF Abstract 1 19 4 12 0

The base graph plot:

Running the script today also changed the ggplot2 graph a little (basically only the labels), see here.

最满意答案

提醒读者昨天的突发新闻（ R 3.0.0终于出来了）可能已经注意到与这个问题直接相关的有趣的事情：

通过Rprof（）进行分析可以选择在语句级别记录信息，而不仅仅是函数级别。

事实上，这个新功能回答了我的问题，我将展示如何。

比方说，我们想比较在计算汇总统计量（如平均值）时，向量化和预分配是否比较好的旧循环和数据增量建立更好。相对愚蠢的代码如下：

# create big data frame: n <- 1000 x <- data.frame(group = sample(letters[1:4], n, replace=TRUE), condition = sample(LETTERS[1:10], n, replace = TRUE), data = rnorm(n)) # reasonable operations: marginal.means.1 <- aggregate(data ~ group + condition, data = x, FUN=mean) # unreasonable operations: marginal.means.2 <- marginal.means.1[NULL,] row.counter <- 1 for (condition in levels(x$condition)) { for (group in levels(x$group)) { tmp.value <- 0 tmp.length <- 0 for (c in 1:nrow(x)) { if ((x[c,"group"] == group) & (x[c,"condition"] == condition)) { tmp.value <- tmp.value + x[c,"data"] tmp.length <- tmp.length + 1 } } marginal.means.2[row.counter,"group"] <- group marginal.means.2[row.counter,"condition"] <- condition marginal.means.2[row.counter,"data"] <- tmp.value / tmp.length row.counter <- row.counter + 1 } } # does it produce the same results? all.equal(marginal.means.1, marginal.means.2)

要使用这个代码与Rprof ，我们需要parse它。也就是说，它需要保存在文件中，然后从那里调用。因此，我将其上传到pastebin ，但它与本地文件的工作方式完全相同。

现在我们

只需创建一个配置文件，并指出我们要保存行号，使用令人难以置信的组合eval(parse(..., keep.source = TRUE)) （似乎臭名昭着的fortune(106)不适用于此，因为我还没有找到另一种方式）停止分析，并指出我们要根据行号输出。

代码是：

Rprof("profile1.out", line.profiling=TRUE) eval(parse(file = "http://pastebin.com/download.php?i=KjdkSVZq", keep.source=TRUE)) Rprof(NULL) summaryRprof("profile1.out", lines = "show")

这使：

$by.self self.time self.pct total.time total.pct download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11 <no location> 4.38 34.93 4.38 34.93 download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48 download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16 $by.total total.time total.pct self.time self.pct download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11 <no location> 4.38 34.93 4.38 34.93 download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48 download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16 $by.line self.time self.pct total.time total.pct <no location> 4.38 34.93 4.38 34.93 download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48 download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11 download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16 $sample.interval [1] 0.02 $sampling.time [1] 12.54

检查源代码告诉我们，有问题的行（＃17）确实是for-loop中的愚蠢的if -statement。与基本没有时间使用矢量化代码计算相同（行＃6）。

我没有尝试任何图形输出，但我已经很深刻的是我到目前为止。

Alert readers of yesterdays breaking news (R 3.0.0 is finally out) may have noticed something interesting that is directly relevant to this question:

Profiling via Rprof() now optionally records information at the statement level, not just the function level.

And indeed, this new feature answers my question and I will show how.

Let's say, we want to compare whether vectorizing and pre-allocating are really better than good old for-loops and incremental building of data in calculating a summary statistic such as the mean. The, relatively stupid, code is the following:

# create big data frame: n <- 1000 x <- data.frame(group = sample(letters[1:4], n, replace=TRUE), condition = sample(LETTERS[1:10], n, replace = TRUE), data = rnorm(n)) # reasonable operations: marginal.means.1 <- aggregate(data ~ group + condition, data = x, FUN=mean) # unreasonable operations: marginal.means.2 <- marginal.means.1[NULL,] row.counter <- 1 for (condition in levels(x$condition)) { for (group in levels(x$group)) { tmp.value <- 0 tmp.length <- 0 for (c in 1:nrow(x)) { if ((x[c,"group"] == group) & (x[c,"condition"] == condition)) { tmp.value <- tmp.value + x[c,"data"] tmp.length <- tmp.length + 1 } } marginal.means.2[row.counter,"group"] <- group marginal.means.2[row.counter,"condition"] <- condition marginal.means.2[row.counter,"data"] <- tmp.value / tmp.length row.counter <- row.counter + 1 } } # does it produce the same results? all.equal(marginal.means.1, marginal.means.2)

To use this code with Rprof, we need to parse it. That is, it needs to be saved in a file and then called from there. Hence, I uploaded it to pastebin, but it works exactly the same with local files.

Now, we

simply create a profile file and indicate that we want to save the line number, source the code with the incredible combination eval(parse(..., keep.source = TRUE)) (seemingly the infamous fortune(106) does not apply here, as I haven't found another way) stop the profiling and indicate that we want the output based on the line numbers.

The code is:

Rprof("profile1.out", line.profiling=TRUE) eval(parse(file = "http://pastebin.com/download.php?i=KjdkSVZq", keep.source=TRUE)) Rprof(NULL) summaryRprof("profile1.out", lines = "show")

Which gives:

$by.self self.time self.pct total.time total.pct download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11 <no location> 4.38 34.93 4.38 34.93 download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48 download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16 $by.total total.time total.pct self.time self.pct download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11 <no location> 4.38 34.93 4.38 34.93 download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48 download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16 $by.line self.time self.pct total.time total.pct <no location> 4.38 34.93 4.38 34.93 download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48 download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11 download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16 $sample.interval [1] 0.02 $sampling.time [1] 12.54

Checking the source code tells us that the problematic line (#17) is indeed the stupid if-statement in the for-loop. Compared with basically no time for calculating the same using vectorized code (line #6).

I haven't tried it with any graphical output, but I am already very impressed by what I got so far.

更多推荐