如何用中位数填充 NA?

编程入门 行业动态 更新时间:2024-10-25 18:23:00
本文介绍了如何用中位数填充 NA?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

示例数据:

set.seed(1) df <- data.frame(years=sort(rep(2005:2010, 12)), months=1:12, value=c(rnorm(60),NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)) head(df) years months value 1 2005 1 -0.6264538 2 2005 2 0.1836433 3 2005 3 -0.8356286 4 2005 4 1.5952808 5 2005 5 0.3295078 6 2005 6 -0.8204684

请告诉我,我如何将 df$value 中的 NA 替换为其他月份的中位数?值"必须包含同一月份所有先前值的中值.也就是说,如果当前月份是 5 月,值"必须包含 5 月份所有先前值的中值.

Tell me please, how i can replace NA in df$value to median of others months? "value" must contain the median of value of all previous values for the same month. That is, if current month is May, "value" must contain the median value for all previous values of the month of May.

推荐答案

或者用ave

df <- data.frame(years=sort(rep(2005:2010, 12)), months=1:12, value=c(rnorm(60),NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)) df$value[is.na(df$value)] <- with(df, ave(value, months, FUN = function(x) median(x, na.rm = TRUE)))[is.na(df$value)]

既然有这么多答案,让我们看看哪个最快.

Since there are so many answers let's see which is fastest.

plyr2 <- function(df){ medDF <- ddply(df,.(months),summarize,median=median(value,na.rm=TRUE)) df$value[is.na(df$value)] <- medDF$median[match(df$months,medDF$months)][is.na(df$value)] df } library(plyr) library(data.table) DT <- data.table(df) setkey(DT, months) benchmark(ave = df$value[is.na(df$value)] <- with(df, ave(value, months, FUN = function(x) median(x, na.rm = TRUE)))[is.na(df$value)], tapply = df$value[61:72] <- with(df, tapply(value, months, median, na.rm=TRUE)), sapply = df[61:72, 3] <- sapply(split(df[1:60, 3], df[1:60, 2]), median), plyr = ddply(df, .(months), transform, value=ifelse(is.na(value), median(value, na.rm=TRUE), value)), plyr2 = plyr2(df), data.table = DT[,value := ifelse(is.na(value), median(value, na.rm=TRUE), value), by=months], order = "elapsed") test replications elapsed relative user.self sys.self user.child sys.child 3 sapply 100 0.209 1.000000 0.196 0.000 0 0 1 ave 100 0.260 1.244019 0.244 0.000 0 0 6 data.table 100 0.271 1.296651 0.264 0.000 0 0 2 tapply 100 0.271 1.296651 0.256 0.000 0 0 5 plyr2 100 1.675 8.014354 1.612 0.004 0 0 4 plyr 100 2.075 9.928230 2.004 0.000 0 0

我敢打赌 data.table 是最快的.

I would have bet that data.table was the fastest.

[ Matthew Dowle ] 这里定时的任务最多需要 0.02 秒 (2.075/100).data.table 认为这无关紧要.尝试将 replications 设置为 1 并增加数据大小.或者计时 3 次运行中最快的时间也是一个常见的经验法则.这些链接中更详细的讨论:

[ Matthew Dowle ] The task being timed here takes at most 0.02 seconds (2.075/100). data.table considers that insignificant. Try setting replications to 1 and increasing the data size, instead. Or timing the fastest of 3 runs is also a common rule of thumb. More verbose discussion in these links :

  • data.table 并不总是最快的证据
  • 基准 平均对应于其他列值的特定数据部分的列值
  • 伦敦 R 报告,2012 年 6 月(幻灯片 21 标题为其他")
  • 极端情况下的按组转换基准
  • Evidence that data.table isn't always fastest
  • Benchmarks in Averaging column values for specific sections of data corresponding to other column values
  • London R presentation, June 2012 (slide 21 headed "Other")
  • A transform by group benchmark in an extreme case

更多推荐

如何用中位数填充 NA?

本文发布于:2023-07-10 08:24:16,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1089828.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:中位数   如何用   NA

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!