我有一个良好平衡的面板数据集,其中包含NA观察。我将使用LOCF,并想知道在每个面板中有多少连续的NA,在进行观察前。 LOCF是一种程序,其中通过使用最后观察结转可以填充缺失值。这可能是有意义的一些时间序列应用程序;也许我们有5分钟增量的天气数据:一个缺失的观测值的一个好的猜测可能是5分钟前做的观察。
显然,它更有意义在一个小组内向前观察一小时,比在同一小组内向下一年进行同样的观察。
我知道您可以设置使用zoo :: na.locf的maxgap参数,但是,我想为我的数据获得更好的感觉。请看一个简单的例子:
require(data.table) set.seed(12345) ###创建一个面板数据集 data< - data.table(id = rep(1:10,each = 10), date = seq(as.POSIXct ('2012-01-01'), as.POSIXct('2012-01-10'), by ='1 day'),x = runif(100) b $ b ###随机分配NA到我们的x变量 na< - sample(1:100,size = 52) data [na,x: ###按组计算连续NA的最大数量...这是我想要的: ### ID连续NA的#1 1 #2 3 #3 3 #4 3 #5 4 #6 5 #... #10 2 ###按组计算NA的总数...这是我得到的: data [is.na(x),.N,by = id]欢迎所有解决方案,但data.table解决方案非常受欢迎;
解决方案 p $ p> data [,max(with(rle(is.na(x)),lengths [values])),by = id] / pre>我只是运行 rle 找到所有连续的 NA 并选择最大长度。
这是一个相当复杂的答案,对于上述 max :
data [,{ tmp = rle(is.na(x)); tmp $ lengths [!tmp $ values] = 0; #modify rle result to ignore non-NA's n = which.max(tmp $ lengths); #find the index in rle of longest NA sequence tmp = rle(is.na(x)); #let's get back to the unmodified rle start = sum(tmp $ lengths [0:(n-1)])+ 1; #并找到开始和结束索引 end = sum(tmp $ lengths [1:n]); list(date [start],date [end],max(tmp $ lengths [tmp $ values]))},by = id]
I have a well balanced panel data set which contains NA observations. I will be using LOCF, and would like to know how many consecutive NA's are in each panel, before carrying observations forward. LOCF is a procedure where by missing values can be "filled in" using the "last observation carried forward". This can make sense it some time-series applications; perhaps we have weather data in 5 minute increments: a good guess at the value of a missing observation might be an observation made 5 minutes earlier.
Obviously, it makes more sense to carry an observation forward one hour within one panel than it does to carry that same observation forward to the next year in the same panel.
I am aware that you can set a "maxgap" argument using zoo::na.locf, however, I want to get a better feel for my data. Please see a simple example:
require(data.table) set.seed(12345) ### Create a "panel" data set data <- data.table(id = rep(1:10, each = 10), date = seq(as.POSIXct('2012-01-01'), as.POSIXct('2012-01-10'), by = '1 day'), x = runif(100)) ### Randomly assign NA's to our "x" variable na <- sample(1:100, size = 52) data[na, x := NA] ### Calculate the max number of consecutive NA's by group...this is what I want: ### ID Consecutive NA's # 1 1 # 2 3 # 3 3 # 4 3 # 5 4 # 6 5 # ... # 10 2 ### Count the total number of NA's by group...this is as far as I get: data[is.na(x), .N, by = id]All solutions are welcomed, but data.table solutions are highly preferred; the data file is large.
解决方案This will do it:
data[, max(with(rle(is.na(x)), lengths[values])), by = id]I just ran rle to find all consecutive NA's and picked the max length.
Here's a rather convoluted answer to the comment question of recovering the date ranges for the above max:
data[, { tmp = rle(is.na(x)); tmp$lengths[!tmp$values] = 0; # modify rle result to ignore non-NA's n = which.max(tmp$lengths); # find the index in rle of longest NA sequence tmp = rle(is.na(x)); # let's get back to the unmodified rle start = sum(tmp$lengths[0:(n-1)]) + 1; # and find the start and end indices end = sum(tmp$lengths[1:n]); list(date[start], date[end], max(tmp$lengths[tmp$values])) }, by = id]
更多推荐
趋势的长度
发布评论