提升ggplot2的性能

编程入门 行业动态 更新时间:2024-10-21 06:24:07
本文介绍了提升ggplot2的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 ggplot2 软件包很容易成为我曾经使用过的最好的绘图系统,除了性能对于大型数据集(约50k点)并不是很好。我正在研究通过Shiny提供Web分析,使用 ggplot2 作为绘图后端,但我对性能并不满意,尤其是与基本图形相反。我的问题是如果有任何具体的方法来提高这种表现。

出发点是以下代码示例:

library( ggplot2) n = 86400#a day in seconds dat = data.frame(id = 1:n,val = sort(runif(n))) dev.new() gg_base = ggplot(dat,aes(x = id,y = val)) gg_point = gg_base + geom_point() gg_line = gg_base + geom_line() gg_both = gg_base + geom_point()+ geom_line() benchplot(gg_point) benchplot(gg_line) benchplot(gg_both) system.time(plot(dat)) system.time(plot(dat,type ='l'))

我在MacPro视网膜上获得以下时间点:

> benchplot(gg_point) step user.self sys.self已过期 1构造0.000 0.000 0.000 2 build 0.321 0.078 0.398 3渲染0.271 0.088 0.359 4绘制2.013 0.018 2.218 5合计2.605 0.184 2.975 > benchplot(gg_line) step user.self sys.self已过期 1构造0.000 0.000 0.000 2构建0.330 0.073 0.403 3渲染0.622 0.095 0.717 4绘制2.078 0.009 2.266 5合计3.030 0.177 3.386 > benchplot(gg_both) step user.self sys.self已过期 1构造0.000 0.000 0.000 2 build 0.602 0.155 0.757 3渲染0.866 0.186 1.051 4绘制4.020 0.030 4.238 5合计5.488 0.371 6.046 > system.time(plot(dat))用户系统已用 1.133 0.004 1.138 #请注意,以下时间在很大程度上取决于图形设备#或不。没有人认为性能好得多,好多了。 > system.time(plot(dat,type ='l'))用户系统已用 1.230 0.003 1.233

有关我的设置的更多信息:

> sessionInfo() R版本2.15.3(2013-03-01)平台:x86_64-apple-darwin9.8.0 / x86_64(64位) 语言环境: [1] C / UTF-8 / C / C / C / C 附加的基本软件包: [1] stats graphics grDevices utils datasets methods base 其他附加软件包: [1] ggplot2_0.9.3.1 通过命名空间加载(并未附加): [1] MASS_7.3-23 RColorBrewer_1 .0-5 colorspace_1.2-1 dichromat_2.0-0 [5] digest_0.6.3 grid_2.15.3 gtable_0.1.2 labeling_0.1 [9] munsell_0.4 plyr_1.8 proto_0.3- 10 reshape2_1.2.2 [13] scales_0.2.3 stringr_0.6.2

解决方案

哈德雷有一个很酷的谈话他的新软件包 dplyr 和 ggvis at user2013。但他可能会更好地告诉他更多关于他自己的信息。

我不确定您的应用程序设计是什么样子,但我经常在喂食前进行数据库内预处理例如,如果绘制时间序列,则实际上不需要在X轴上显示每一秒的每一秒。相反,你可能想要聚合并获得例如最小/最大/平均值。一到五分钟的时间间隔。

下面是我多年前写的一个函数的例子,它在SQL中做了类似的事情。这个特殊的例子使用模运算符,因为时间存储为纪元毫秒。但是,如果SQL中的数据正确存储为日期/日期时间结构,那么SQL有一些更优雅的本地方法可以按时间段进行聚合。

#'@param表格名称#'@param开始时间/日期#'@param结束时间/日期#'@param聚合days ,hours,mins或weeks#'@param group grouping变量#'@param目标列的列名称(y轴)#'@export minmaxdata < - 函数(表,开始,结束,聚合= c(天,小时,分钟,星期),组= 1,列){ #dates start< - round(unclass(as.POSIXct(start))* 1000); end < - round(unclass(as.POSIXct(end))* 1000); #必须汇总汇总< - match.arg(汇总); $ b $ calcluate modulus mod < - switch(aggregate,mins= 1000 * 60,hours= 1000 * 60 * 60,days= 1000 * 60 * 60 * 24,weeks= 1000 * 60 * 60 * 24 * 7, stop(无效累计值)) ; #我们需要添加gmt和pst之间的时间差以使模数工作 delta < - 1000 * 60 * 60 *(24 - unclass(as.POSIXct(format(Sys .time(),tz =GMT)) - Sys.time())); $ b #form query query< - paste(SELECT,group,AS grouping,AVG(,column,)AS yavg,MAX(,column,) AS ymax,MIN(,column,)AS ymin,((CMilliseconds_g +,delta,)DIV,mod,)AS timediv FROM,table,WHERE CMilliseconds_g BETWEEN,start,AND结束,GROUP BY,组,,timediv;) mydata< - getquery(query); $ b $ #data mydata $ time< - structure(mod * mydata [[timediv]] / 1000 - delta / 1000,class = c(POSIXct,POSIXt )); mydata $ grouping< - as.factor(mydata $ grouping) #round timestamps if(aggregate%in%c(mins,hours)) { mydata $ time< - round(mydata $ time,aggregate)} else { mydata $ time< - as.Date(mydata $ time); } #return return(mydata)}

The ggplot2 package is easily the best plotting system I ever worked with, except that the performance is not really good for larger datasets (~50k points). I'm looking into providing web analyses through Shiny, using ggplot2 as the plotting backend, but I'm not really happy with the performance, especially in contrast with base graphics. My question is if there any concrete ways to increase this performance.

The starting point is the following code example:

library(ggplot2) n = 86400 # a day in seconds dat = data.frame(id = 1:n, val = sort(runif(n))) dev.new() gg_base = ggplot(dat, aes(x = id, y = val)) gg_point = gg_base + geom_point() gg_line = gg_base + geom_line() gg_both = gg_base + geom_point() + geom_line() benchplot(gg_point) benchplot(gg_line) benchplot(gg_both) system.time(plot(dat)) system.time(plot(dat, type = 'l'))

I get the following timings on my MacPro retina:

> benchplot(gg_point) step user.self sys.self elapsed 1 construct 0.000 0.000 0.000 2 build 0.321 0.078 0.398 3 render 0.271 0.088 0.359 4 draw 2.013 0.018 2.218 5 TOTAL 2.605 0.184 2.975 > benchplot(gg_line) step user.self sys.self elapsed 1 construct 0.000 0.000 0.000 2 build 0.330 0.073 0.403 3 render 0.622 0.095 0.717 4 draw 2.078 0.009 2.266 5 TOTAL 3.030 0.177 3.386 > benchplot(gg_both) step user.self sys.self elapsed 1 construct 0.000 0.000 0.000 2 build 0.602 0.155 0.757 3 render 0.866 0.186 1.051 4 draw 4.020 0.030 4.238 5 TOTAL 5.488 0.371 6.046 > system.time(plot(dat)) user system elapsed 1.133 0.004 1.138 # Note that the timing below depended heavily on wether or net the graphics device # was in view or not. Not in view made performance much, much better. > system.time(plot(dat, type = 'l')) user system elapsed 1.230 0.003 1.233

Some more info on my setup:

> sessionInfo() R version 2.15.3 (2013-03-01) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ggplot2_0.9.3.1 loaded via a namespace (and not attached): [1] MASS_7.3-23 RColorBrewer_1.0-5 colorspace_1.2-1 dichromat_2.0-0 [5] digest_0.6.3 grid_2.15.3 gtable_0.1.2 labeling_0.1 [9] munsell_0.4 plyr_1.8 proto_0.3-10 reshape2_1.2.2 [13] scales_0.2.3 stringr_0.6.2

解决方案

Hadley had a cool talk about his new packages dplyr and ggvis at user2013. But he can probably better tell more about that himself.

I'm not sure what your application design looks like, but I often do in-database pre-processing before feeding the data to R. For example, if you are plotting time series, there is really no need to show every second of the day on the X axis. Instead you might want to aggregate and get the min/max/mean over e.g. one or five minute time intervals.

Below an example of a function I wrote years ago that did something like that in SQL. This particular example uses the modulo operator because times were stored as epoch millis. But if data in SQL are properly stored as date/datetime structures, SQL has some more elegant native methods to aggregate by time periods.

#' @param table name of the table #' @param start start time/date #' @param end end time/date #' @param aggregate one of "days", "hours", "mins" or "weeks" #' @param group grouping variable #' @param column name of the target column (y axis) #' @export minmaxdata <- function(table, start, end, aggregate=c("days", "hours", "mins", "weeks"), group=1, column){ #dates start <- round(unclass(as.POSIXct(start))*1000); end <- round(unclass(as.POSIXct(end))*1000); #must aggregate aggregate <- match.arg(aggregate); #calcluate modulus mod <- switch(aggregate, "mins" = 1000*60, "hours" = 1000*60*60, "days" = 1000*60*60*24, "weeks" = 1000*60*60*24*7, stop("invalid aggregate value") ); #we need to add the time differene between gmt and pst to make modulo work delta <- 1000 * 60 * 60 * (24 - unclass(as.POSIXct(format(Sys.time(), tz="GMT")) - Sys.time())); #form query query <- paste("SELECT", group, "AS grouping, AVG(", column, ") AS yavg, MAX(", column, ") AS ymax, MIN(", column, ") AS ymin, ((CMilliseconds_g +", delta, ") DIV", mod, ") AS timediv FROM", table, "WHERE CMilliseconds_g BETWEEN", start, "AND", end, "GROUP BY", group, ", timediv;") mydata <- getquery(query); #data mydata$time <- structure(mod*mydata[["timediv"]]/1000 - delta/1000, class=c("POSIXct", "POSIXt")); mydata$grouping <- as.factor(mydata$grouping) #round timestamps if(aggregate %in% c("mins", "hours")){ mydata$time <- round(mydata$time, aggregate) } else { mydata$time <- as.Date(mydata$time); } #return return(mydata) }

更多推荐

提升ggplot2的性能

本文发布于:2023-10-29 03:47:45,感谢您对本站的认可!
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:性能

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!