我有一些数据,其中包括对特定状态适用于个人的时间范围的观察。一个人在给定的时间可以具有多种状态,或者根本没有任何统计信息(在这种情况下将没有观察值)。
ID状态开始结束 1 A 2013-03-07 2013-04-20 1 B 2013-04-10 2013-05-10 1 C 2013-04-16 2013-07-11 1 B 2013-07-25 2013-08-08 2 A 2013-09-10 2014-04-21 2 C 2013-12-27 2014- 01-26 2 D 2014-04-28 2014-05-10 2 E 2014-05-11 2014-07-16 $ p我想将其转换为一个数据帧,该数据帧包含不同的,非重叠的时间段以及在那些时间段内应用的所有状态。该示例的结果如下。 ID状态开始结束 1 A 2013-03-07 2013 -04-10 1 A,B 2013-04-10 2013-04-16 1 A,B,C 2013-04-16 2013-04-20 1 B,C 2013-04-20 2013-05-10 1 C 2013-05-10 2013-07-11 1 B 2013-07-25 2013-08-08 2 A 2013- 09-10 2013-12-27 2 A,C 2013-12-27 2014-01-26 2 D 2014-04-28 2014-05-10 2 E 2014- 2014年5月11日我的数据很大,因此我很难找到一个做到这一点的有效方法。我发现了半相关的问题,例如此处,但它们通常涉及将重叠的区域折叠成一个
我们将不胜感激。谢谢。
解决方案library(dplyr) #创建所有可能的日期范围 date_breaks<-sort(c(df $ START,df $ END)) #使用所有可能的组合构建一个data.frame df2< ;-expand.grid(START = head(date_breaks,-1), ID =唯一(df $ ID), STATUS =唯一(df $ STATUS), stringsAsFactors = F) df2 $ END<-tail(date_breaks,-1) #加入原始数据并过滤到开始和结束的范围 df2%&%;% inner_join(df,by = c( ID, STATUS))%> %% 过滤器(START.y< = START.x,END.y> = END.x )%>% group_by(ID,START = START.x,END = END.x)%&%;% summarise(STATUS = paste(unique(STATUS),崩溃=',') )%>% select(ID,STATUS,START,END) #小技巧:11 x 4 #组:ID,START [11] #ID状态开始结束#< i nt> < chr> < date> < date> #1 1 A 2013-03-07 2013-04-10 #2 1 A,B 2013-04-10 2013-04-16 #3 1 A,B,C 2013-04-16 2013-04-20 #4 1 B,C 2013-04-20 2013-05-10 #5 1 C 2013-05-10 2013-07-11 #6 1 B 2013-07-25 2013-08-08 #7 2 A 2013-09-10 2013-12-27 #8 2 A,C 2013-12-27 2014 -01-26 #9 2 A 2014-01-26 2014-04-21 #10 2 D 2014-04-28 2014-05-10 #11 2 E 2014- 2014年5月11日
数据
df<-read.table(text = ID STATUS START END 1 A 2013-03-07 2013-04-20 1 B 2013-04-10 2013-05-10 1 C 2013-04-16 2013-07-11 1 B 2013-07-25 2013-08-08 2 A 2013-09-10 2014-04-21 2 C 2013-12-27 2014-01-26 2 D 2014-04-28 2014-05-10 2 E 2014-05-11 2014-07-16,标头= T, colClasses = c('整数','字符','日期','日期'))
I have data that consists of observations of the range of time a particular status applied for an individual. An individual can have multiple statuses at a given time, or have no stats at all (in which case there would not be an observation).
ID STATUS START END 1 A 2013-03-07 2013-04-20 1 B 2013-04-10 2013-05-10 1 C 2013-04-16 2013-07-11 1 B 2013-07-25 2013-08-08 2 A 2013-09-10 2014-04-21 2 C 2013-12-27 2014-01-26 2 D 2014-04-28 2014-05-10 2 E 2014-05-11 2014-07-16I would like to convert this into a dataframe consisting of distinct, non-overlapping periods and all of the statuses that applied during those periods. The result from the example would be as below.
ID STATUS START END 1 A 2013-03-07 2013-04-10 1 A,B 2013-04-10 2013-04-16 1 A,B,C 2013-04-16 2013-04-20 1 B,C 2013-04-20 2013-05-10 1 C 2013-05-10 2013-07-11 1 B 2013-07-25 2013-08-08 2 A 2013-09-10 2013-12-27 2 A,C 2013-12-27 2014-01-26 2 D 2014-04-28 2014-05-10 2 E 2014-05-11 2014-07-16My data is fairly large, and so I struggling to find an efficient way to do this. I've found semi-related problems, such as here, but they typically involve collapsing overlapping regions into a super region, rather than breaking them out into distinct subregions.
Any help or ideas would be appreciated. Thanks.
解决方案 library(dplyr) # Create all the possible date ranges date_breaks <- sort(c(df$START, df$END)) # Build a data.frame with all possible combinations df2 <- expand.grid(START = head(date_breaks, -1), ID = unique(df$ID), STATUS = unique(df$STATUS), stringsAsFactors = F) df2$END <- tail(date_breaks, -1) # Join in original data and filter to where the start and end are in range df2 %>% inner_join(df, by = c("ID", "STATUS")) %>% filter(START.y <= START.x, END.y >= END.x) %>% group_by(ID, START = START.x, END = END.x) %>% summarise(STATUS = paste(unique(STATUS), collapse = ', ')) %>% select(ID, STATUS, START, END) # A tibble: 11 x 4 # Groups: ID, START [11] # ID STATUS START END # <int> <chr> <date> <date> # 1 1 A 2013-03-07 2013-04-10 # 2 1 A, B 2013-04-10 2013-04-16 # 3 1 A, B, C 2013-04-16 2013-04-20 # 4 1 B, C 2013-04-20 2013-05-10 # 5 1 C 2013-05-10 2013-07-11 # 6 1 B 2013-07-25 2013-08-08 # 7 2 A 2013-09-10 2013-12-27 # 8 2 A, C 2013-12-27 2014-01-26 # 9 2 A 2014-01-26 2014-04-21 # 10 2 D 2014-04-28 2014-05-10 # 11 2 E 2014-05-11 2014-07-16Data
df <- read.table(text = "ID STATUS START END 1 A 2013-03-07 2013-04-20 1 B 2013-04-10 2013-05-10 1 C 2013-04-16 2013-07-11 1 B 2013-07-25 2013-08-08 2 A 2013-09-10 2014-04-21 2 C 2013-12-27 2014-01-26 2 D 2014-04-28 2014-05-10 2 E 2014-05-11 2014-07-16", header = T, colClasses = c('integer', 'character', 'Date', 'Date'))
更多推荐
将重叠时间范围的数据转换为不同时间范围的数据
发布评论