使用dplyr分组数据中的cumsum(cumsum in grouped data with dplyr)

我有一个数据框df （可以在这里下载），参考公司的注册表，看起来像这样：

Provider.ID Local.Authority month year entry exit total 1 1-102642676 Warwickshire 10 2010 2 0 2 2 1-102642676 Bury 10 2010 1 0 1 3 1-102642676 Kent 10 2010 1 0 1 4 1-102642676 Essex 10 2010 1 0 1 5 1-102642676 Lambeth 10 2010 2 0 2 6 1-102642676 East Sussex 10 2010 5 0 5 7 1-102642676 Bristol, City of 10 2010 1 0 1 8 1-102642676 Liverpool 10 2010 1 0 1 9 1-102642676 Merton 10 2010 1 0 1 10 1-102642676 Cheshire East 10 2010 2 0 2 11 1-102642676 Knowsley 10 2010 1 0 1 12 1-102642676 North Yorkshire 10 2010 1 0 1 13 1-102642676 Kingston upon Thames 10 2010 1 0 1 14 1-102642676 Lewisham 10 2010 1 0 1 15 1-102642676 Wiltshire 10 2010 1 0 1 16 1-102642676 Hampshire 10 2010 1 0 1 17 1-102642676 Wandsworth 10 2010 1 0 1 18 1-102642676 Brent 10 2010 1 0 1 19 1-102642676 West Sussex 10 2010 1 0 1 20 1-102642676 Windsor and Maidenhead 10 2010 1 0 1 21 1-102642676 Luton 10 2010 1 0 1 22 1-102642676 Enfield 10 2010 1 0 1 23 1-102642676 Somerset 10 2010 1 0 1 24 1-102642676 Cambridgeshire 10 2010 1 0 1 25 1-102642676 Hillingdon 10 2010 1 0 1 26 1-102642676 Havering 10 2010 1 0 1 27 1-102642676 Solihull 10 2010 1 0 1 28 1-102642676 Bexley 10 2010 1 0 1 29 1-102642676 Sandwell 10 2010 1 0 1 30 1-102642676 Southampton 10 2010 1 0 1 31 1-102642676 Trafford 10 2010 1 0 1 32 1-102642676 Newham 10 2010 1 0 1 33 1-102642676 West Berkshire 10 2010 1 0 1 34 1-102642676 Reading 10 2010 1 0 1 35 1-102642676 Hartlepool 10 2010 1 0 1 36 1-102642676 Hampshire 3 2011 1 0 1 37 1-102642676 Kent 9 2011 0 1 -1 38 1-102642676 North Yorkshire 12 2011 0 1 -1 39 1-102642676 North Somerset 12 2012 2 0 2 40 1-102642676 Kent 10 2014 1 0 1 41 1-102642676 Somerset 1 2016 0 1 -1

我的目标是创建一个变量，反映每个Local.Authority和每年的最后一个变量（ total ）的累积总和。 total只是entry和exit之间的差异。我试图通过在以下基础上应用dplyr来执行此操作：

library(dplyr) df.1 = df %>% group_by(Local.Authority, year) %>% mutate(cum.total = cumsum(total)) %>% arrange(year, month, Local.Authority)

产生以下（错误）结果：

> df.1 Source: local data frame [41 x 8] Groups: Local.Authority, year [41] Provider.ID Local.Authority month year entry exit total cum.total <fctr> <fctr> <int> <int> <int> <int> <int> <int> 1 1-102642676 Bexley 10 2010 1 0 1 35 2 1-102642676 Brent 10 2010 1 0 1 25 3 1-102642676 Bristol, City of 10 2010 1 0 1 13 4 1-102642676 Bury 10 2010 1 0 1 3 5 1-102642676 Cambridgeshire 10 2010 1 0 1 31 6 1-102642676 Cheshire East 10 2010 2 0 2 17 7 1-102642676 East Sussex 10 2010 5 0 5 12 8 1-102642676 Enfield 10 2010 1 0 1 29 9 1-102642676 Essex 10 2010 1 0 1 5 10 1-102642676 Hampshire 10 2010 1 0 1 23 .. ... ... ... ... ... ... ... ...

我通过检查不同年份出现的变量Local.Authority中的级别（例如Kent）来确认这些结果：

> check = df.1 %>% filter(Local.Authority == "Kent") > check Source: local data frame [3 x 8] Groups: Local.Authority, year [3] Provider.ID Local.Authority month year entry exit total cum.total <fctr> <fctr> <int> <int> <int> <int> <int> <int> 1 1-102642676 Kent 10 2010 1 0 1 4 2 1-102642676 Kent 9 2011 0 1 -1 42 3 1-102642676 Kent 10 2014 1 0 1 44

它应该在哪里：

Provider.ID Local.Authority month year entry exit total cum.total <fctr> <fctr> <int> <int> <int> <int> <int> <int> 1 1-102642676 Kent 10 2010 1 0 1 1 2 1-102642676 Kent 9 2011 0 1 -1 0 3 1-102642676 Kent 10 2014 1 0 1 1

有谁知道将这些结果从cumsum中取出可能会发生什么？提前谢谢了。

I have a data frame df (which can be downloaded here) referred to a register of companies that looks something like this:

Provider.ID Local.Authority month year entry exit total 1 1-102642676 Warwickshire 10 2010 2 0 2 2 1-102642676 Bury 10 2010 1 0 1 3 1-102642676 Kent 10 2010 1 0 1 4 1-102642676 Essex 10 2010 1 0 1 5 1-102642676 Lambeth 10 2010 2 0 2 6 1-102642676 East Sussex 10 2010 5 0 5 7 1-102642676 Bristol, City of 10 2010 1 0 1 8 1-102642676 Liverpool 10 2010 1 0 1 9 1-102642676 Merton 10 2010 1 0 1 10 1-102642676 Cheshire East 10 2010 2 0 2 11 1-102642676 Knowsley 10 2010 1 0 1 12 1-102642676 North Yorkshire 10 2010 1 0 1 13 1-102642676 Kingston upon Thames 10 2010 1 0 1 14 1-102642676 Lewisham 10 2010 1 0 1 15 1-102642676 Wiltshire 10 2010 1 0 1 16 1-102642676 Hampshire 10 2010 1 0 1 17 1-102642676 Wandsworth 10 2010 1 0 1 18 1-102642676 Brent 10 2010 1 0 1 19 1-102642676 West Sussex 10 2010 1 0 1 20 1-102642676 Windsor and Maidenhead 10 2010 1 0 1 21 1-102642676 Luton 10 2010 1 0 1 22 1-102642676 Enfield 10 2010 1 0 1 23 1-102642676 Somerset 10 2010 1 0 1 24 1-102642676 Cambridgeshire 10 2010 1 0 1 25 1-102642676 Hillingdon 10 2010 1 0 1 26 1-102642676 Havering 10 2010 1 0 1 27 1-102642676 Solihull 10 2010 1 0 1 28 1-102642676 Bexley 10 2010 1 0 1 29 1-102642676 Sandwell 10 2010 1 0 1 30 1-102642676 Southampton 10 2010 1 0 1 31 1-102642676 Trafford 10 2010 1 0 1 32 1-102642676 Newham 10 2010 1 0 1 33 1-102642676 West Berkshire 10 2010 1 0 1 34 1-102642676 Reading 10 2010 1 0 1 35 1-102642676 Hartlepool 10 2010 1 0 1 36 1-102642676 Hampshire 3 2011 1 0 1 37 1-102642676 Kent 9 2011 0 1 -1 38 1-102642676 North Yorkshire 12 2011 0 1 -1 39 1-102642676 North Somerset 12 2012 2 0 2 40 1-102642676 Kent 10 2014 1 0 1 41 1-102642676 Somerset 1 2016 0 1 -1

My goal is to create a variable that reflects the cumulative sum of the last variable (total) for each Local.Authority and each year. total is just the difference between entry and exit. I have tried to carry out this operation by applying dplyr on the following basis:

library(dplyr) df.1 = df %>% group_by(Local.Authority, year) %>% mutate(cum.total = cumsum(total)) %>% arrange(year, month, Local.Authority)

Yielding the (wrong) result below:

> df.1 Source: local data frame [41 x 8] Groups: Local.Authority, year [41] Provider.ID Local.Authority month year entry exit total cum.total <fctr> <fctr> <int> <int> <int> <int> <int> <int> 1 1-102642676 Bexley 10 2010 1 0 1 35 2 1-102642676 Brent 10 2010 1 0 1 25 3 1-102642676 Bristol, City of 10 2010 1 0 1 13 4 1-102642676 Bury 10 2010 1 0 1 3 5 1-102642676 Cambridgeshire 10 2010 1 0 1 31 6 1-102642676 Cheshire East 10 2010 2 0 2 17 7 1-102642676 East Sussex 10 2010 5 0 5 12 8 1-102642676 Enfield 10 2010 1 0 1 29 9 1-102642676 Essex 10 2010 1 0 1 5 10 1-102642676 Hampshire 10 2010 1 0 1 23 .. ... ... ... ... ... ... ... ...

I have confirmed these results by checking levels in the variable Local.Authority that appear in various years (for example Kent):

> check = df.1 %>% filter(Local.Authority == "Kent") > check Source: local data frame [3 x 8] Groups: Local.Authority, year [3] Provider.ID Local.Authority month year entry exit total cum.total <fctr> <fctr> <int> <int> <int> <int> <int> <int> 1 1-102642676 Kent 10 2010 1 0 1 4 2 1-102642676 Kent 9 2011 0 1 -1 42 3 1-102642676 Kent 10 2014 1 0 1 44

Where it should be:

Provider.ID Local.Authority month year entry exit total cum.total <fctr> <fctr> <int> <int> <int> <int> <int> <int> 1 1-102642676 Kent 10 2010 1 0 1 1 2 1-102642676 Kent 9 2011 0 1 -1 0 3 1-102642676 Kent 10 2014 1 0 1 1

Would anyone know what may be happening for getting these results out of the cumsum? Many thanks in advance.

最满意答案

当你按local.Authority＆year分组时，它采用唯一值并将结果打印为1，-1,1所以更好的组仅由local.Authority，其中cumsum基于总值和结果1,0,1

df <- df %>% group_by(Local.Authority) %>% mutate(cum.to = cumsum(total)) > df Source: local data frame [3 x 8] Groups: Local.Authority [1] Provider.ID Local.Authority month year entry exit total cum.to <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1-102642676 Kent 10 2010 1 0 1 1 2 1-102642676 Kent 9 2011 0 1 -1 0 3 1-102642676 Kent 10 2014 1 0 1 1

When you group by local.Authority & year it takes unique values and print the result as 1,-1,1 so better group by only local.Authority where cumsum works based on total values and result 1,0,1

df <- df %>% group_by(Local.Authority) %>% mutate(cum.to = cumsum(total)) > df Source: local data frame [3 x 8] Groups: Local.Authority [1] Provider.ID Local.Authority month year entry exit total cum.to <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1-102642676 Kent 10 2010 1 0 1 1 2 1-102642676 Kent 9 2011 0 1 -1 0 3 1-102642676 Kent 10 2014 1 0 1 1

更多推荐

使用dplyr分组数据中的cumsum(cumsum in grouped data with dplyr)

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表