如何用na.spline替换data.table中的NA值

编程入门行业动态更新时间:2024-10-26 12:21:48

本文介绍了如何用na.spline替换data.table中的NA值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我正在尝试准备从欧盟统计局(Eurostat)检索到的一些人口统计数据，以进行进一步处理，其中包括用相应的近似数据替换所有缺失的数据.

首先我只使用data.frames，但是后来我确信data.tables可能比常规data.frames更具优势，所以我迁移到了data.tables.

我观察到的一件事是，将"na.spline"与"apply"与"na.spline"结合使用作为data.table的一部分时得到了不同的结果.

#1源数据

(dt0<-data.table(年龄，性别，地理位置\时间" = c(总计，F，AD"，总计，F，AL"，总计，F，AM"，总计，F，AT"，总计，F，AZ")，"2014" = c(NA，NA，NA，4351253，NA)，"2013" = c(37408，NA，NA，4328238，4707690)，"2012" = c(38252，NA，1684000，4309977，4651601)，"2011" = c(38252，1409931，1679066，4296293，4594023)，"2010" = c(40296，NA，1673656，4285442，4542083)))

哪个生成

年龄，性别，地理\时间2014 2013 2012 2011 20101:TOTAL，F，AD不适用37408 38252 38252 402962:TOTAL，F，AL不适用不适用不适用1409931不适用3:总计，F，AM NA NA 1684000 1679066 16736564:TOTAL，F，AT 4351253 4328238 4309977 4296293 42854425:TOTAL，F，AZ不适用4707690 4651601 4594023 4542083

分成行标题" ...

(dt0a<-dt0 [，1，with = FALSE])(cn0a<-colnames(dt0a))

...以及人口数据...

(dt0b<-dt0 [，2:ncol(dt0)，with = FALSE])(cn0ba<-colnames(dt0b))

#2用"na.spline"替换 NA s&应用"

(dt1b<-data.table(t(apply(dt0b，1，na.spline，na.rm = FALSE))))(setnames(dt1b，cn0b))(dt1<-cbind(dt0a，dt1b))

这将导致...

年龄，性别，地理\时间2014 2013 2012 2011 20101:TOTAL，F，AD 32832 37408 38252 38252 402962:TOTAL，F，AL 1409931 1409931 1409931 1409931 14099313:总，F，AM 1692440 1688458 1684000 1679066 16736564:TOTAL，F，AT 4351253 4328238 4309977 4296293 42854425:TOTAL，F，AZ 4755163 4707690 4651601 4594023 4542083

#3替换"data.table"中的 NA

(dt2b<-dt0b [，lapply(.SD，na.spline，na.rm = FALSE)])(dt2<-cbind(dt0a，dt2b))

等等...

年龄，性别，地理\时间2014 2013 2012 2011 20101:TOTAL，F，AD 4351253 37408 38252.0 38252 40296.02:TOTAL，F，AL 4351253 1993097 -611513.8 1409931 -629363.23:总，F，AM 4351253 3423374 1684000.0 1679066 1673656.04:TOTAL，F，AT 4351253 4328238 4309977.0 4296293 4285442.05:TOTAL，F，AZ 4351253 4707690 4651601.0 4594023 4542083.0

#4比较结果

(相同(dt1，dt2))

考虑到上述不足为奇...

[1]否

(用方法2代替 NA 的值是我感兴趣的值，仅通过方法3生成).

追求"data.table"路由(方法3)的原因是一种性能(正如在各种帖子中所指出的那样，当使用"apply"时，正在执行矩阵运算，这需要花费大量时间).比仅使用"data.table"的相应方法要长得多的时间.

我对R相当陌生，我认为我做了一些根本上是错误的事情，唯一的是，我还没有最清楚的主意是什么.

任何向我指出正确方向的帮助都倍受赞赏！

-Sil68

解决方案

使用矩阵.在 matrix 上使用矩阵运算并不慢:

mat<-t(as.matrix(dt0 [，-1]))colnames(mat)<-dt0 [[1]]mat []<-na.spline(mat，na.rm = FALSE)

给出

TOTAL，F，AD TOTAL，F，AL TOTAL，F，AM TOTAL，F，AT TOTAL，F，AZ2014 32832 1409931 1692440 4351253 47551632013 37408 1409931 1688458 4328238 47076902012 38252 1409931 1684000 4309977 46516012011 38252 1409931 1679066 4296293 45940232010 40296 1409931 1673656 4285442 4542083

使用数据表.如果您想使用 data.table ，请

mat<-t(as.matrix(dt0 [，-1]))colnames(mat)<-dt0 [[1]]DT<-data.table(mat，keep.rownames = TRUE)DT [，(vn):= lapply(.SD，na.spline，na.rm = FALSE),. SDcols = vn]

通过引用更新 DT ，给出

rn总计，F，AD总计，F，AL总计，F，AM总计，F，AT总计，F，AZ1:2014年32832 1409931 1692440 4351253 47551632:2013 37408 1409931 1688458 4328238 47076903:2012 38252 1409931 1684000 4309977 46516014:2011 38252 1409931 1679066 4296293 45940235:2010 40296 1409931 1673656 4285442 4542083

基准化:

mat<-t(as.matrix(dt0 [，-1]))colnames(mat)<-dt0 [[1]]DT<-data.table(mat，keep.rownames = TRUE)vn<-名称(DT)[-1]tvn<-名称(dt0)[-1]要求(微基准测试)微基准transp = dt0 [，as.list(na.spline(unlist(.SD)，na.rm = FALSE))，by = 1:nrow(dt0),. SDcols = tvn]，lapply = DT [，lapply(.SD，na.spline，na.rm = FALSE),. SDcols = vn]，apply = apply(mat，2，na.spline，na.rm = FALSE)，fun = na.spline(mat，na.rm = FALSE)，次= 10)

结果:

单位:毫秒expr min lq平均中位数uq max neval运输4.666934 4.734891 4.850268 4.787690 4.897202 5.259957 10拉普利3.923823 4.010356 4.327646 4.039445 4.049957 6.976446 10适用2.505556 2.525601 2.578890 2.585978 2.592090 2.758801 10乐趣1.945290 1.994178 2.063063 2.068490 2.085112 2.272846 10

"transp"结果显示@shadow解决方案的时间安排，该解决方案保留了OP的格式.由于 na.spline 的工作原理，此处不需要 apply .

I'm trying to prepare some demographic data retrieved from Eurostat for further processing, amongst others replacing any missing data with corresponding approximated ones.

First I was using data.frames only, but then I got convinced that data.tables might offer some advantages over regular data.frames, so I migrated to data.tables.

One thing I've observed while doing so was getting different results when using "na.spline" in combination with "apply" versus "na.spline" as part of the data.table.

#1 source data

(dt0 <- data.table( "age,sex,geo\\time" = c("TOTAL,F,AD", "TOTAL,F,AL", "TOTAL,F,AM", "TOTAL,F,AT", "TOTAL,F,AZ"), "2014" = c(NA, NA, NA, 4351253, NA), "2013" = c(37408, NA, NA, 4328238, 4707690), "2012" = c(38252, NA, 1684000, 4309977, 4651601), "2011" = c(38252, 1409931, 1679066, 4296293, 4594023), "2010" = c(40296, NA, 1673656, 4285442, 4542083) ))

Which generates

age,sex,geo\\time 2014 2013 2012 2011 2010 1: TOTAL,F,AD NA 37408 38252 38252 40296 2: TOTAL,F,AL NA NA NA 1409931 NA 3: TOTAL,F,AM NA NA 1684000 1679066 1673656 4: TOTAL,F,AT 4351253 4328238 4309977 4296293 4285442 5: TOTAL,F,AZ NA 4707690 4651601 4594023 4542083

Split into "row headings"...

(dt0a <- dt0[, 1, with=FALSE]) (cn0a <- colnames(dt0a))

...and population data...

(dt0b <- dt0[, 2:ncol(dt0), with=FALSE]) (cn0ba <- colnames(dt0b))

#2 replace NAs by "na.spline" & "apply"

(dt1b <- data.table(t(apply(dt0b, 1, na.spline, na.rm=FALSE)))) (setnames(dt1b, cn0b)) (dt1 <- cbind(dt0a, dt1b))

Which results in...

age,sex,geo\\time 2014 2013 2012 2011 2010 1: TOTAL,F,AD 32832 37408 38252 38252 40296 2: TOTAL,F,AL 1409931 1409931 1409931 1409931 1409931 3: TOTAL,F,AM 1692440 1688458 1684000 1679066 1673656 4: TOTAL,F,AT 4351253 4328238 4309977 4296293 4285442 5: TOTAL,F,AZ 4755163 4707690 4651601 4594023 4542083

#3 replace NAs within "data.table"

(dt2b <- dt0b[,lapply(.SD, na.spline, na.rm=FALSE)]) (dt2 <- cbind(dt0a, dt2b))

Et voila...

age,sex,geo\\time 2014 2013 2012 2011 2010 1: TOTAL,F,AD 4351253 37408 38252.0 38252 40296.0 2: TOTAL,F,AL 4351253 1993097 -611513.8 1409931 -629363.2 3: TOTAL,F,AM 4351253 3423374 1684000.0 1679066 1673656.0 4: TOTAL,F,AT 4351253 4328238 4309977.0 4296293 4285442.0 5: TOTAL,F,AZ 4351253 4707690 4651601.0 4594023 4542083.0

#4 compare results

(identical(dt1, dt2))

Considering the aforementioned not quite a surprise...

[1] FALSE

(The values calculated as a replacement for NAs with approach #2 are the ones I'd be interested in, only generated via approach #3).

Reason for pursuing the "data.table" route (approach #3) is one of performance (as it has been pointed out in various posts, when using "apply" a matrix operation is getting carried out, which is taking a considerably longer time than a corresponding approach facilitating "data.table" only).

Being pretty new to R I reckon I've done something quite fundamentally wrong, only thing is, I haven't got the faintest idea what this might be.

Any help pointing me in the right direction is more than appreciated!

-Sil68

解决方案

Using a matrix. Using a matrix operation on a matrix is not slow:

mat <- t(as.matrix(dt0[,-1])) colnames(mat) <- dt0[[1]] mat[] <- na.spline(mat,na.rm=FALSE)

which gives

TOTAL,F,AD TOTAL,F,AL TOTAL,F,AM TOTAL,F,AT TOTAL,F,AZ 2014 32832 1409931 1692440 4351253 4755163 2013 37408 1409931 1688458 4328238 4707690 2012 38252 1409931 1684000 4309977 4651601 2011 38252 1409931 1679066 4296293 4594023 2010 40296 1409931 1673656 4285442 4542083

Using a data.table. If you instead want to use a data.table, do

mat <- t(as.matrix(dt0[,-1])) colnames(mat) <- dt0[[1]] DT <- data.table(mat,keep.rownames=TRUE) DT[,(vn):=lapply(.SD,na.spline,na.rm=FALSE),.SDcols=vn]

which updates DT by reference, giving

rn TOTAL,F,AD TOTAL,F,AL TOTAL,F,AM TOTAL,F,AT TOTAL,F,AZ 1: 2014 32832 1409931 1692440 4351253 4755163 2: 2013 37408 1409931 1688458 4328238 4707690 3: 2012 38252 1409931 1684000 4309977 4651601 4: 2011 38252 1409931 1679066 4296293 4594023 5: 2010 40296 1409931 1673656 4285442 4542083

Benchmarking:

mat <- t(as.matrix(dt0[,-1])) colnames(mat) <- dt0[[1]] DT <- data.table(mat,keep.rownames=TRUE) vn <- names(DT)[-1] tvn <- names(dt0)[-1] require(microbenchmark) microbenchmark( transp = dt0[,as.list(na.spline(unlist(.SD), na.rm=FALSE)),by=1:nrow(dt0),.SDcols=tvn], lapply = DT[,lapply(.SD,na.spline,na.rm=FALSE),.SDcols=vn], apply = apply(mat,2,na.spline,na.rm=FALSE), fun = na.spline(mat,na.rm=FALSE), times=10)

Results:

Unit: milliseconds expr min lq mean median uq max neval transp 4.666934 4.734891 4.850268 4.787690 4.897202 5.259957 10 lapply 3.923823 4.010356 4.327646 4.039445 4.049957 6.976446 10 apply 2.505556 2.525601 2.578890 2.585978 2.592090 2.758801 10 fun 1.945290 1.994178 2.063063 2.068490 2.085112 2.272846 10

The "transp" result shows the timing of @shadow's solution, which preserves the OP's formatting. apply is not necessary here, thanks to how na.spline works.

更多推荐

如何用na.spline替换data.table中的NA值

本文发布于:2023-07-10 08:27:09，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1089837.html