这个问题在这里已有答案:
查找所有重复行,包括“具有较小下标的元素” 3个答案我有一个数据集,其中包含每个日期的许多唯一标识符,例如
df <- data.frame(date = as.Date(c("2016-01-01", "2016-01-01", "2016-01-02", "2016-01-02")), ids = c(3, 4, 1, 3))然后,我想总结一下这些信息,以获取当前日期出现的新唯一ID的数量。 例如,1月1日有两个独特的ID(3和4)。 但是在1月2日,只有一个新的唯一ID(1)。 因此,结果数据框应如下所示:
date n_new_unique_ids 2016-01-01 2 2016-01-02 1这对dplyr有可能吗? 我看一下lag但固定的滞后大小在这种情况下没有意义。 或者可能还有其他套餐?
This question already has an answer here:
Finding ALL duplicate rows, including “elements with smaller subscripts” 5 answersI have a data set which contains a number of unique identifiers for each date, e.g.
df <- data.frame(date = as.Date(c("2016-01-01", "2016-01-01", "2016-01-02", "2016-01-02")), ids = c(3, 4, 1, 3))I'd then like to summarise this information to get the number of new unique ids that appear on the current date. For example, on January 1 there are two uniques ids (3 and 4). But on January 2, there is only one new unique id (1). So the resulting data frame should look like:
date n_new_unique_ids 2016-01-01 2 2016-01-02 1Is this possible with dplyr? I had a look at lag but a fixed lag size doesn't make sense in this context. Or perhaps with another package?
最满意答案
一种选择是从数据集中删除所有duplicated “ID”
df %>% filter(!(duplicated(ids)|duplicated(ids, fromLast=TRUE))) # date ids #1 2016-01-01 2 #2 2016-01-02 3更新
使用更新的数据
df %>% arrange(date, ids) %>% filter(!duplicated(ids)) %>% group_by(date) %>% summarise(n_unique_ids = n()) # date n_unique_ids # <date> <int> #1 2016-01-01 2 #2 2016-01-02 1One option would be to remove all the duplicated 'ids' from the dataset
df %>% filter(!(duplicated(ids)|duplicated(ids, fromLast=TRUE))) # date ids #1 2016-01-01 2 #2 2016-01-02 3Update
Using the updated data
df %>% arrange(date, ids) %>% filter(!duplicated(ids)) %>% group_by(date) %>% summarise(n_unique_ids = n()) # date n_unique_ids # <date> <int> #1 2016-01-01 2 #2 2016-01-02 1更多推荐
发布评论