我有一个data.table看起来像这样
ID,Order,Segment 1, A 1,2,B 1,3,B 1,4,C 1,5,B 1,6,B 1,7,B 1,8,B订单列。我想了解每个ID的连续B的数量。理想情况下,我想要的输出是
ID,Consec 1,2 1,4因为段B在行2和3(2次)中连续出现,然后在行5中连续出现, 6,7,8(4次)。
循环解决方案非常明显,但也很慢。
在data.table中有没有优雅的解决方案,也很快?
$ b library(data.table)#v1.9.5 + DT [order(ID,Order)] [,indx:= rleid(Segment) 'b', list(Consec = .N),by = list(indx,ID)] [,indx:= NULL] [] #ID Consec # 1:1 2 #2:1 4
或@eddi建议
DT [order(ID,Order)] [,。(Consec = .N),by =。(ID,Segment, rleid(Segment))] [Segment =='B',。(ID,Consec)] #ID Consec #1:1 2 #2:1 4更有效的方法是使用 setorder 而不是 order (由@Arun建议)
setorder ,ID,Order)[,。(Consec = .N),by =。(ID,Segment, rleid(Segment))] [Segment =='B',...(ID,Consec) $ b#ID Consec #1:1 2 #2:1 4
I have a data.table that looks like this
ID, Order, Segment 1, 1, A 1, 2, B 1, 3, B 1, 4, C 1, 5, B 1, 6, B 1, 7, B 1, 8, BBasically by ordering the data using the Order column. I would like to understand the number of consecutive B's for each of the ID's. Ideally the output I would like is
ID, Consec 1, 2 1, 4Because the segment B appears consecutively in row 2 and 3 (2 times), and then again in row 5,6,7,8 (4 times).
The loop solution is quite obvious but would also be very slow.
Are there elegant solutions in data.table that is also fast?
P.S. The data I am dealing with has ~20 million rows.
解决方案Try
library(data.table)#v1.9.5+ DT[order(ID, Order)][, indx:=rleid(Segment)][Segment=='B', list(Consec=.N), by = list(indx, ID)][,indx:=NULL][] # ID Consec #1: 1 2 #2: 1 4Or as @eddi suggested
DT[order(ID, Order)][, .(Consec = .N), by = .(ID, Segment, rleid(Segment))][Segment == 'B', .(ID, Consec)] # ID Consec #1: 1 2 #2: 1 4A more memory efficient method would be to use setorder instead of order (as suggested by @Arun)
setorder(DT, ID, Order)[, .(Consec = .N), by = .(ID, Segment, rleid(Segment))][Segment == 'B', .(ID, Consec)] # ID Consec #1: 1 2 #2: 1 4
更多推荐
如何计算元素在数据表中连续出现的次数?
发布评论