修改 该问题最初是要求输入data.table的.带有任何包装的解决方案都会很有趣.
edit The question was originally asked for data.table. A solution with any package would be interesting.
我对一个更普遍的问题的特定变化有些困惑.我有与data.table一起使用的面板数据,我想使用group.data.table的功能来填写一些缺少的值.不幸的是,它们不是数字的,因此我不能简单地插值,而只能根据条件进行填充.是否有可能在data.tables中执行一种条件na.locf?
I am a little stuck with a particular variation of a more general problem. I have panel data that I am using with data.table and I would like to fill in some missing values using the group by functionality of data.table. Unfortunately they are not numeric, so I can't simply interpolate, but they should only be filled in based on a condition. Is it possible to perform a kind of conditional na.locf in data.tables?
基本上,我只想填写NA,前提是在NA之后的下一个观察值是先前的观察值,尽管更普遍的问题是如何有条件地填写NA.
Essentially I only want to fill in the NAs if after the NAs the next observation is the previous ones, though the more general question is how to conditionally fill in NAs.
例如,在以下数据中,我想按每个id组填写associatedid变量.因此,id==1,year==2003将作为ABC123填写,因为它在NA之前和之后的值,但对于相同的id则不是2000. id== 2不会更改,因为下一个值与NA之前的值不同. id==3将填写2003年和2004年的内容.
For example, in the following data I would like to fill in the associatedid variable by each id group. So id==1 , year==2003 would fill in as ABC123 because its the value before and after the NA, but not 2000 for the same id. id== 2 would not be changed because the next value is not the same as the one prior to the NAs. id==3 would fill in for 2003 and 2004.
mydf <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L), associatedid = structure(c(NA, 1L, 1L, NA, 1L, 1L, NA, 1L, 1L, NA, 2L, 2L, NA, 1L, 1L, NA, NA, 1L), .Label = c("ABC123", "DEF456"), class = "factor")), class = "data.frame", row.names = c(NA, -18L)) mydf #> id year associatedid #> 1 1 2000 <NA> #> 2 1 2001 ABC123 #> 3 1 2002 ABC123 #> 4 1 2003 <NA> #> 5 1 2004 ABC123 #> 6 1 2005 ABC123 #> 7 2 2000 <NA> #> 8 2 2001 ABC123 #> 9 2 2002 ABC123 #> 10 2 2003 <NA> #> 11 2 2004 DEF456 #> 12 2 2005 DEF456 #> 13 3 2000 <NA> #> 14 3 2001 ABC123 #> 15 3 2002 ABC123 #> 16 3 2003 <NA> #> 17 3 2004 <NA> #> 18 3 2005 ABC123 dt = data.table(mydf, key = c("id"))所需的输出
#> id year associatedid #> 1 1 2000 <NA> #> 2 1 2001 ABC123 #> 3 1 2002 ABC123 #> 4 1 2003 ABC123 #> 5 1 2004 ABC123 #> 6 1 2005 ABC123 #> 7 2 2000 <NA> #> 8 2 2001 ABC123 #> 9 2 2002 ABC123 #> 10 2 2003 <NA> #> 11 2 2004 DEF456 #> 12 2 2005 DEF456 #> 13 3 2000 <NA> #> 14 3 2001 ABC123 #> 15 3 2002 ABC123 #> 16 3 2003 ABC123 #> 17 3 2004 ABC123 #> 18 3 2005 ABC123推荐答案
这都是关于编写经过修改的na.locf函数的.之后,您可以将其插入到data.table中,就像其他任何函数一样.
This is all about writing a modified na.locf function. After that you can plug it into data.table like any other function.
new.locf <- function(x){ # might want to think about the end of this loop # this works here but you might need to add another case # if there are NA's as the last value. # # anyway, loop through observations in a vector, x. for(i in 2:(length(x)-1)){ nextval = i # find the next, non-NA value # again, not tested but might break if there isn't one? while(nextval <= length(x)-1 & is.na(x[nextval])){ nextval = nextval + 1 } # if the current value is not NA, great! if(!is.na(x[i])){ x[i] <- x[i] }else{ # if the current value is NA, and the last value is a value # (should given the nature of this loop), and # the next value, as calculated above, is the same as the last # value, then give us that value. if(is.na(x[i]) & !is.na(x[i-1]) & x[i-1] == x[nextval]){ x[i] <- x[nextval] }else{ # finally, return NA if neither of these conditions hold x[i] <- NA } } } # return the new vector return(x) }一旦有了该功能,便可以照常使用data.table:
Once we have that function, we can use data.table as usual:
dt2 <- dt[,list(year = year, # when I read your data in, associatedid read as factor associatedid = new.locf(as.character(associatedid)) ), by = "id" ]这将返回:
> dt2 id year associatedid 1: 1 2000 NA 2: 1 2001 ABC123 3: 1 2002 ABC123 4: 1 2003 ABC123 5: 1 2004 ABC123 6: 1 2005 ABC123 7: 2 2000 NA 8: 2 2001 ABC123 9: 2 2002 ABC123 10: 2 2003 NA 11: 2 2004 DEF456 12: 2 2005 DEF456 13: 3 2000 NA 14: 3 2001 ABC123 15: 3 2002 ABC123 16: 3 2003 ABC123 17: 3 2004 ABC123 18: 3 2005 ABC123据我所知,这正是您所寻找的.
which is what you are looking for as best I understand it.
我在new.locf定义中提供了一些套期保值,因此您可能仍需要考虑一下,但这应该可以帮助您入门.
I provided some hedging in the new.locf definition so you still might have a little thinking to do but this should get you started.
更多推荐
按组有条件NA填充
发布评论