我有大量客户的不同产品的开始和结束日期的数据。不同产品的间隔可能与购买之间有重叠或有时间差距:
库(lubridate)库(Hmisc) 库(dplyr) user_id< - c(rep(12,8),rep(33,5)) start_date< - dmy (31/10/2010,18/12/2010,31/10/2011,18/12/2011,27/03/2014,18/12/2014,27/03/2015,18/12/2016, 01/07/1992,20/08/1993,28/10/1999,31/01/2006,26/08/2016)) end_date< - dmy(Cs(31/10 / 2011,18/12/2011,28/04/2014,18/12/2014,27/03/2015,18/12/2016,27/03/2016,18/12/2017, 01 / 07/2016,16/08/2016,15/11/2012,28/02/2006,26/01/2017)) data< - data.frame(user_id,start_date, end_date) data user_id start_date end_date 1 12 2010-10-31 2011-10-31 2 12 2010-12-18 2011-12-18 3 12 2011-10-31 2014-04-28 4 12 2011-12-18 2014-12-18 5 12 2014-03-27 2015-03-27 6 12 2014-12-18 2016-12-18 7 12 2015-03- 27 2016-03-27 8 12 2016-12-18 2017-12-18 9 33 1992-07-01 2016-07-01 10 33 1993-08-20 2016 -08-16 11 33 1999-10-28 2012-11-15 12 33 2006-01-31 2006-02-28 13 33 2016-08-26 2017-01 -26我想计算活动天数或月份的总数他/她持有任何产品。
如果产品总是重叠就不会有问题,那么我可以简单地采取
data%>% group_by(user_id)%>% dplyr :: summarize(time_diff = max(end_date ) - min(start_date))但是,如您在用户33中可以看到的,产品不总是重叠,它们的间隔必须分别添加到所有重叠间隔。
有一种快速优雅的方式来编码,希望在 dplyr ?
解决方案我们可以使用 dplyr 中的函数计算总天数。以下示例展开每个时间段,然后删除重复的日期。最后计算每个 user_id 的总行号。
data2 < - data%>% rowwise()%>% do(data_frame(user_id =。$ user_id, Date = seq(。$ start_date,$ end_date,by = 1 )))%>% distinct()%>% ungroup()%>% count(user_id) pre>I have data listing start and end dates for different products for a big number of customers. The intervals for different products can overlap or have time gaps between purchases:
library(lubridate) library(Hmisc) library(dplyr) user_id <- c(rep(12, 8), rep(33, 5)) start_date <- dmy(Cs(31/10/2010, 18/12/2010, 31/10/2011, 18/12/2011, 27/03/2014, 18/12/2014, 27/03/2015, 18/12/2016, 01/07/1992, 20/08/1993, 28/10/1999, 31/01/2006, 26/08/2016)) end_date <- dmy(Cs(31/10/2011, 18/12/2011, 28/04/2014, 18/12/2014, 27/03/2015, 18/12/2016, 27/03/2016, 18/12/2017, 01/07/2016, 16/08/2016, 15/11/2012, 28/02/2006, 26/01/2017)) data <- data.frame(user_id, start_date, end_date) data user_id start_date end_date 1 12 2010-10-31 2011-10-31 2 12 2010-12-18 2011-12-18 3 12 2011-10-31 2014-04-28 4 12 2011-12-18 2014-12-18 5 12 2014-03-27 2015-03-27 6 12 2014-12-18 2016-12-18 7 12 2015-03-27 2016-03-27 8 12 2016-12-18 2017-12-18 9 33 1992-07-01 2016-07-01 10 33 1993-08-20 2016-08-16 11 33 1999-10-28 2012-11-15 12 33 2006-01-31 2006-02-28 13 33 2016-08-26 2017-01-26I'd like to calculate the total number of active days or months during which he/she held any the products.
It wouldn't be a problem if the products ALWAYS overlapped as then I could simply take
data %>% group_by(user_id) %>% dplyr::summarize(time_diff = max(end_date) - min(start_date))However, as you can see in user 33, products don't always overlap and their interval has to be added separately to all 'overlapped' intervals.
Is there a quick and elegant way to code it, hopefully in dplyr?
解决方案We can use functions from dplyr to count the total number of days. The following example expands each time period, and then removes duplicated dates. Finally count the total row number for each user_id.
data2 <- data %>% rowwise() %>% do(data_frame(user_id = .$user_id, Date = seq(.$start_date, .$end_date, by = 1))) %>% distinct() %>% ungroup() %>% count(user_id)
更多推荐
从重叠日期计算活动日/月
发布评论