提升ggplot2的性能

编程入门行业动态更新时间:2024-10-21 06:24:07

本文介绍了提升ggplot2的性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述 ggplot2 软件包很容易成为我曾经使用过的最好的绘图系统，除了性能对于大型数据集（约50k点）并不是很好。我正在研究通过Shiny提供Web分析，使用 ggplot2 作为绘图后端，但我对性能并不满意，尤其是与基本图形相反。我的问题是如果有任何具体的方法来提高这种表现。
出发点是以下代码示例：
library（ ggplot2） n = 86400＃a day in seconds dat = data.frame（id = 1：n，val = sort（runif（n））） dev.new（） gg_base = ggplot（dat，aes（x = id，y = val）） gg_point = gg_base + geom_point（） gg_line = gg_base + geom_line（） gg_both = gg_base + geom_point（）+ geom_line（） benchplot（gg_point） benchplot（gg_line） benchplot（gg_both） system.time（plot（dat）） system.time（plot（dat，type ='l'））
我在MacPro视网膜上获得以下时间点：
> benchplot（gg_point） step user.self sys.self已过期 1构造0.000 0.000 0.000 2 build 0.321 0.078 0.398 3渲染0.271 0.088 0.359 4绘制2.013 0.018 2.218 5合计2.605 0.184 2.975 > benchplot（gg_line） step user.self sys.self已过期 1构造0.000 0.000 0.000 2构建0.330 0.073 0.403 3渲染0.622 0.095 0.717 4绘制2.078 0.009 2.266 5合计3.030 0.177 3.386 > benchplot（gg_both） step user.self sys.self已过期 1构造0.000 0.000 0.000 2 build 0.602 0.155 0.757 3渲染0.866 0.186 1.051 4绘制4.020 0.030 4.238 5合计5.488 0.371 6.046 > system.time（plot（dat））用户系统已用 1.133 0.004 1.138 ＃请注意，以下时间在很大程度上取决于图形设备＃或不。没有人认为性能好得多，好多了。 > system.time（plot（dat，type ='l'））用户系统已用 1.230 0.003 1.233
有关我的设置的更多信息：
> sessionInfo（） R版本2.15.3（2013-03-01）平台：x86_64-apple-darwin9.8.0 / x86_64（64位）语言环境： [1] C / UTF-8 / C / C / C / C 附加的基本软件包： [1] stats graphics grDevices utils datasets methods base 其他附加软件包： [1] ggplot2_0.9.3.1 通过命名空间加载（并未附加）： [1] MASS_7.3-23 RColorBrewer_1 .0-5 colorspace_1.2-1 dichromat_2.0-0 [5] digest_0.6.3 grid_2.15.3 gtable_0.1.2 labeling_0.1 [9] munsell_0.4 plyr_1.8 proto_0.3- 10 reshape2_1.2.2 [13] scales_0.2.3 stringr_0.6.2

解决方案
哈德雷有一个很酷的谈话他的新软件包 dplyr 和 ggvis at user2013。但他可能会更好地告诉他更多关于他自己的信息。

我不确定您的应用程序设计是什么样子，但我经常在喂食前进行数据库内预处理例如，如果绘制时间序列，则实际上不需要在X轴上显示每一秒的每一秒。相反，你可能想要聚合并获得例如最小/最大/平均值。一到五分钟的时间间隔。
下面是我多年前写的一个函数的例子，它在SQL中做了类似的事情。这个特殊的例子使用模运算符，因为时间存储为纪元毫秒。但是，如果SQL中的数据正确存储为日期/日期时间结构，那么SQL有一些更优雅的本地方法可以按时间段进行聚合。
＃'@param表格名称＃'@param开始时间/日期＃'@param结束时间/日期＃'@param聚合days ，hours，mins或weeks＃'@param group grouping变量＃'@param目标列的列名称（y轴）＃'@export minmaxdata < - 函数（表，开始，结束，聚合= c（天，小时，分钟，星期），组= 1，列）{ #dates start< - round（unclass（as.POSIXct（start））* 1000）; end < - round（unclass（as.POSIXct（end））* 1000）; ＃必须汇总汇总< - match.arg（汇总）; $ b $ calcluate modulus mod < - switch（aggregate，mins= 1000 * 60，hours= 1000 * 60 * 60，days= 1000 * 60 * 60 * 24，weeks= 1000 * 60 * 60 * 24 * 7， stop（无效累计值）） ; ＃我们需要添加gmt和pst之间的时间差以使模数工作 delta < - 1000 * 60 * 60 *（24 - unclass（as.POSIXct（format（Sys .time（），tz =GMT）） - Sys.time（）））; $ b #form query query< - paste（SELECT，group，AS grouping，AVG（，column，）AS yavg，MAX（，column，） AS ymax，MIN（，column，）AS ymin，（（CMilliseconds_g +，delta，）DIV，mod，）AS timediv FROM，table，WHERE CMilliseconds_g BETWEEN，start，AND结束，GROUP BY，组，，timediv;） mydata< - getquery（query）; $ b $ #data mydata $ time< - structure（mod * mydata [[timediv]] / 1000 - delta / 1000，class = c（POSIXct，POSIXt ））; mydata $ grouping< - as.factor（mydata $ grouping） #round timestamps if（aggregate％in％c（mins，hours）） { mydata $ time< - round（mydata $ time，aggregate）} else { mydata $ time< - as.Date（mydata $ time）; } #return return（mydata）}

The ggplot2 package is easily the best plotting system I ever worked with, except that the performance is not really good for larger datasets (~50k points). I'm looking into providing web analyses through Shiny, using ggplot2 as the plotting backend, but I'm not really happy with the performance, especially in contrast with base graphics. My question is if there any concrete ways to increase this performance.

The starting point is the following code example:
library(ggplot2) n = 86400 # a day in seconds dat = data.frame(id = 1:n, val = sort(runif(n))) dev.new() gg_base = ggplot(dat, aes(x = id, y = val)) gg_point = gg_base + geom_point() gg_line = gg_base + geom_line() gg_both = gg_base + geom_point() + geom_line() benchplot(gg_point) benchplot(gg_line) benchplot(gg_both) system.time(plot(dat)) system.time(plot(dat, type = 'l'))
I get the following timings on my MacPro retina:
> benchplot(gg_point) step user.self sys.self elapsed 1 construct 0.000 0.000 0.000 2 build 0.321 0.078 0.398 3 render 0.271 0.088 0.359 4 draw 2.013 0.018 2.218 5 TOTAL 2.605 0.184 2.975 > benchplot(gg_line) step user.self sys.self elapsed 1 construct 0.000 0.000 0.000 2 build 0.330 0.073 0.403 3 render 0.622 0.095 0.717 4 draw 2.078 0.009 2.266 5 TOTAL 3.030 0.177 3.386 > benchplot(gg_both) step user.self sys.self elapsed 1 construct 0.000 0.000 0.000 2 build 0.602 0.155 0.757 3 render 0.866 0.186 1.051 4 draw 4.020 0.030 4.238 5 TOTAL 5.488 0.371 6.046 > system.time(plot(dat)) user system elapsed 1.133 0.004 1.138 # Note that the timing below depended heavily on wether or net the graphics device # was in view or not. Not in view made performance much, much better. > system.time(plot(dat, type = 'l')) user system elapsed 1.230 0.003 1.233
Some more info on my setup:
> sessionInfo() R version 2.15.3 (2013-03-01) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ggplot2_0.9.3.1 loaded via a namespace (and not attached): [1] MASS_7.3-23 RColorBrewer_1.0-5 colorspace_1.2-1 dichromat_2.0-0 [5] digest_0.6.3 grid_2.15.3 gtable_0.1.2 labeling_0.1 [9] munsell_0.4 plyr_1.8 proto_0.3-10 reshape2_1.2.2 [13] scales_0.2.3 stringr_0.6.2

解决方案
Hadley had a cool talk about his new packages dplyr and ggvis at user2013. But he can probably better tell more about that himself.

I'm not sure what your application design looks like, but I often do in-database pre-processing before feeding the data to R. For example, if you are plotting time series, there is really no need to show every second of the day on the X axis. Instead you might want to aggregate and get the min/max/mean over e.g. one or five minute time intervals.

Below an example of a function I wrote years ago that did something like that in SQL. This particular example uses the modulo operator because times were stored as epoch millis. But if data in SQL are properly stored as date/datetime structures, SQL has some more elegant native methods to aggregate by time periods.
#' @param table name of the table #' @param start start time/date #' @param end end time/date #' @param aggregate one of "days", "hours", "mins" or "weeks" #' @param group grouping variable #' @param column name of the target column (y axis) #' @export minmaxdata <- function(table, start, end, aggregate=c("days", "hours", "mins", "weeks"), group=1, column){ #dates start <- round(unclass(as.POSIXct(start))*1000); end <- round(unclass(as.POSIXct(end))*1000); #must aggregate aggregate <- match.arg(aggregate); #calcluate modulus mod <- switch(aggregate, "mins" = 1000*60, "hours" = 1000*60*60, "days" = 1000*60*60*24, "weeks" = 1000*60*60*24*7, stop("invalid aggregate value") ); #we need to add the time differene between gmt and pst to make modulo work delta <- 1000 * 60 * 60 * (24 - unclass(as.POSIXct(format(Sys.time(), tz="GMT")) - Sys.time())); #form query query <- paste("SELECT", group, "AS grouping, AVG(", column, ") AS yavg, MAX(", column, ") AS ymax, MIN(", column, ") AS ymin, ((CMilliseconds_g +", delta, ") DIV", mod, ") AS timediv FROM", table, "WHERE CMilliseconds_g BETWEEN", start, "AND", end, "GROUP BY", group, ", timediv;") mydata <- getquery(query); #data mydata$time <- structure(mod*mydata[["timediv"]]/1000 - delta/1000, class=c("POSIXct", "POSIXt")); mydata$grouping <- as.factor(mydata$grouping) #round timestamps if(aggregate %in% c("mins", "hours")){ mydata$time <- round(mydata$time, aggregate) } else { mydata$time <- as.Date(mydata$time); } #return return(mydata) }

更多推荐

提升ggplot2的性能

本文发布于:2023-10-29 03:47:45，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1538634.html

版权声明:本站内容均来自互联网，仅供演示用，请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系，我们将在24小时内删除。

性能

上一篇：与pg承诺的相互依赖的交易

下一篇： NodeJS，承诺和性能

发布评论取消回复

评论列表（有 0 条评论）

最近发表

荆门网站建设的重要性

win10蓝屏终止代码CRITICAL_PROCESS_DIED解决方法

您可以尝试添加 --skip-broken 选项来解决该问题您可以尝试执行：rpm -Va --nofiles --nodigest 解决方案

关于无线网络波动大的解决办法

Windows10 关于系统中断CPU占用过高导致电脑变卡的解决办法

VS 2019 点击页面自动定位到解决方案资源管理器目录位置

（亲测解决）VMware打开需要半天才进入、打开系统很慢、运行很慢解决办法

Typora官网下载的最新版本mac10.13以下版本用不了的解决办法

成功解决ModuleNotFoundError: No module named ‘torch._C‘

MySQL:由于找不到VCRUNTIME140_1.dll，无法继续执行代码。重新安装程序可能会解决此问题

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍！

热门文章

从源“http://localhost:5173”访问“...”处的 XMLHttpRequest 已被 CORS 策略阻止

币安API错误代码1102，未发送强制参数“时间戳”

如果我在bot telegram nodejs中使用editMessageMedia，我如何制作标题

在 Node.js 中从网络流创建 blob

使用 Node.js / ES6 如何设置 dotenv 文件的自定义路径？

使用 NODE.JS 和 html5 实现低延迟（50 毫秒）视频流

如何从nodejs连接laravel>laravel

使用nodejs观看目录

如果文件包含特定字符串，如何跳过 GitHub 工作流程步骤？

FirebaseError：无法从.env加载环境变量

标签列表

文件

如何在

Python

系统

java

方法

数据

错误

windows

函数

android

linux

教程

如何使用

代码

字符串

计算机

电脑

服务器

NET

应用程序

数组

PHP

MySQL

SQL

对象

项目

程序

数据库

word