使用dplyr分组数据中的cumsum(cumsum in grouped data with dplyr)

编程入门 行业动态 更新时间:2024-10-28 00:14:10
使用dplyr分组数据中的cumsum(cumsum in grouped data with dplyr)

我有一个数据框df (可以在这里下载),参考公司的注册表,看起来像这样:

Provider.ID Local.Authority month year entry exit total 1 1-102642676 Warwickshire 10 2010 2 0 2 2 1-102642676 Bury 10 2010 1 0 1 3 1-102642676 Kent 10 2010 1 0 1 4 1-102642676 Essex 10 2010 1 0 1 5 1-102642676 Lambeth 10 2010 2 0 2 6 1-102642676 East Sussex 10 2010 5 0 5 7 1-102642676 Bristol, City of 10 2010 1 0 1 8 1-102642676 Liverpool 10 2010 1 0 1 9 1-102642676 Merton 10 2010 1 0 1 10 1-102642676 Cheshire East 10 2010 2 0 2 11 1-102642676 Knowsley 10 2010 1 0 1 12 1-102642676 North Yorkshire 10 2010 1 0 1 13 1-102642676 Kingston upon Thames 10 2010 1 0 1 14 1-102642676 Lewisham 10 2010 1 0 1 15 1-102642676 Wiltshire 10 2010 1 0 1 16 1-102642676 Hampshire 10 2010 1 0 1 17 1-102642676 Wandsworth 10 2010 1 0 1 18 1-102642676 Brent 10 2010 1 0 1 19 1-102642676 West Sussex 10 2010 1 0 1 20 1-102642676 Windsor and Maidenhead 10 2010 1 0 1 21 1-102642676 Luton 10 2010 1 0 1 22 1-102642676 Enfield 10 2010 1 0 1 23 1-102642676 Somerset 10 2010 1 0 1 24 1-102642676 Cambridgeshire 10 2010 1 0 1 25 1-102642676 Hillingdon 10 2010 1 0 1 26 1-102642676 Havering 10 2010 1 0 1 27 1-102642676 Solihull 10 2010 1 0 1 28 1-102642676 Bexley 10 2010 1 0 1 29 1-102642676 Sandwell 10 2010 1 0 1 30 1-102642676 Southampton 10 2010 1 0 1 31 1-102642676 Trafford 10 2010 1 0 1 32 1-102642676 Newham 10 2010 1 0 1 33 1-102642676 West Berkshire 10 2010 1 0 1 34 1-102642676 Reading 10 2010 1 0 1 35 1-102642676 Hartlepool 10 2010 1 0 1 36 1-102642676 Hampshire 3 2011 1 0 1 37 1-102642676 Kent 9 2011 0 1 -1 38 1-102642676 North Yorkshire 12 2011 0 1 -1 39 1-102642676 North Somerset 12 2012 2 0 2 40 1-102642676 Kent 10 2014 1 0 1 41 1-102642676 Somerset 1 2016 0 1 -1

我的目标是创建一个变量,反映每个Local.Authority和每年的最后一个变量( total )的累积总和。 total只是entry和exit之间的差异。 我试图通过在以下基础上应用dplyr来执行此操作:

library(dplyr) df.1 = df %>% group_by(Local.Authority, year) %>% mutate(cum.total = cumsum(total)) %>% arrange(year, month, Local.Authority)

产生以下(错误)结果:

> df.1 Source: local data frame [41 x 8] Groups: Local.Authority, year [41] Provider.ID Local.Authority month year entry exit total cum.total <fctr> <fctr> <int> <int> <int> <int> <int> <int> 1 1-102642676 Bexley 10 2010 1 0 1 35 2 1-102642676 Brent 10 2010 1 0 1 25 3 1-102642676 Bristol, City of 10 2010 1 0 1 13 4 1-102642676 Bury 10 2010 1 0 1 3 5 1-102642676 Cambridgeshire 10 2010 1 0 1 31 6 1-102642676 Cheshire East 10 2010 2 0 2 17 7 1-102642676 East Sussex 10 2010 5 0 5 12 8 1-102642676 Enfield 10 2010 1 0 1 29 9 1-102642676 Essex 10 2010 1 0 1 5 10 1-102642676 Hampshire 10 2010 1 0 1 23 .. ... ... ... ... ... ... ... ...

我通过检查不同年份出现的变量Local.Authority中的级别(例如Kent)来确认这些结果:

> check = df.1 %>% filter(Local.Authority == "Kent") > check Source: local data frame [3 x 8] Groups: Local.Authority, year [3] Provider.ID Local.Authority month year entry exit total cum.total <fctr> <fctr> <int> <int> <int> <int> <int> <int> 1 1-102642676 Kent 10 2010 1 0 1 4 2 1-102642676 Kent 9 2011 0 1 -1 42 3 1-102642676 Kent 10 2014 1 0 1 44

它应该在哪里:

Provider.ID Local.Authority month year entry exit total cum.total <fctr> <fctr> <int> <int> <int> <int> <int> <int> 1 1-102642676 Kent 10 2010 1 0 1 1 2 1-102642676 Kent 9 2011 0 1 -1 0 3 1-102642676 Kent 10 2014 1 0 1 1

有谁知道将这些结果从cumsum中取出可能会发生什么? 提前谢谢了。

I have a data frame df (which can be downloaded here) referred to a register of companies that looks something like this:

Provider.ID Local.Authority month year entry exit total 1 1-102642676 Warwickshire 10 2010 2 0 2 2 1-102642676 Bury 10 2010 1 0 1 3 1-102642676 Kent 10 2010 1 0 1 4 1-102642676 Essex 10 2010 1 0 1 5 1-102642676 Lambeth 10 2010 2 0 2 6 1-102642676 East Sussex 10 2010 5 0 5 7 1-102642676 Bristol, City of 10 2010 1 0 1 8 1-102642676 Liverpool 10 2010 1 0 1 9 1-102642676 Merton 10 2010 1 0 1 10 1-102642676 Cheshire East 10 2010 2 0 2 11 1-102642676 Knowsley 10 2010 1 0 1 12 1-102642676 North Yorkshire 10 2010 1 0 1 13 1-102642676 Kingston upon Thames 10 2010 1 0 1 14 1-102642676 Lewisham 10 2010 1 0 1 15 1-102642676 Wiltshire 10 2010 1 0 1 16 1-102642676 Hampshire 10 2010 1 0 1 17 1-102642676 Wandsworth 10 2010 1 0 1 18 1-102642676 Brent 10 2010 1 0 1 19 1-102642676 West Sussex 10 2010 1 0 1 20 1-102642676 Windsor and Maidenhead 10 2010 1 0 1 21 1-102642676 Luton 10 2010 1 0 1 22 1-102642676 Enfield 10 2010 1 0 1 23 1-102642676 Somerset 10 2010 1 0 1 24 1-102642676 Cambridgeshire 10 2010 1 0 1 25 1-102642676 Hillingdon 10 2010 1 0 1 26 1-102642676 Havering 10 2010 1 0 1 27 1-102642676 Solihull 10 2010 1 0 1 28 1-102642676 Bexley 10 2010 1 0 1 29 1-102642676 Sandwell 10 2010 1 0 1 30 1-102642676 Southampton 10 2010 1 0 1 31 1-102642676 Trafford 10 2010 1 0 1 32 1-102642676 Newham 10 2010 1 0 1 33 1-102642676 West Berkshire 10 2010 1 0 1 34 1-102642676 Reading 10 2010 1 0 1 35 1-102642676 Hartlepool 10 2010 1 0 1 36 1-102642676 Hampshire 3 2011 1 0 1 37 1-102642676 Kent 9 2011 0 1 -1 38 1-102642676 North Yorkshire 12 2011 0 1 -1 39 1-102642676 North Somerset 12 2012 2 0 2 40 1-102642676 Kent 10 2014 1 0 1 41 1-102642676 Somerset 1 2016 0 1 -1

My goal is to create a variable that reflects the cumulative sum of the last variable (total) for each Local.Authority and each year. total is just the difference between entry and exit. I have tried to carry out this operation by applying dplyr on the following basis:

library(dplyr) df.1 = df %>% group_by(Local.Authority, year) %>% mutate(cum.total = cumsum(total)) %>% arrange(year, month, Local.Authority)

Yielding the (wrong) result below:

> df.1 Source: local data frame [41 x 8] Groups: Local.Authority, year [41] Provider.ID Local.Authority month year entry exit total cum.total <fctr> <fctr> <int> <int> <int> <int> <int> <int> 1 1-102642676 Bexley 10 2010 1 0 1 35 2 1-102642676 Brent 10 2010 1 0 1 25 3 1-102642676 Bristol, City of 10 2010 1 0 1 13 4 1-102642676 Bury 10 2010 1 0 1 3 5 1-102642676 Cambridgeshire 10 2010 1 0 1 31 6 1-102642676 Cheshire East 10 2010 2 0 2 17 7 1-102642676 East Sussex 10 2010 5 0 5 12 8 1-102642676 Enfield 10 2010 1 0 1 29 9 1-102642676 Essex 10 2010 1 0 1 5 10 1-102642676 Hampshire 10 2010 1 0 1 23 .. ... ... ... ... ... ... ... ...

I have confirmed these results by checking levels in the variable Local.Authority that appear in various years (for example Kent):

> check = df.1 %>% filter(Local.Authority == "Kent") > check Source: local data frame [3 x 8] Groups: Local.Authority, year [3] Provider.ID Local.Authority month year entry exit total cum.total <fctr> <fctr> <int> <int> <int> <int> <int> <int> 1 1-102642676 Kent 10 2010 1 0 1 4 2 1-102642676 Kent 9 2011 0 1 -1 42 3 1-102642676 Kent 10 2014 1 0 1 44

Where it should be:

Provider.ID Local.Authority month year entry exit total cum.total <fctr> <fctr> <int> <int> <int> <int> <int> <int> 1 1-102642676 Kent 10 2010 1 0 1 1 2 1-102642676 Kent 9 2011 0 1 -1 0 3 1-102642676 Kent 10 2014 1 0 1 1

Would anyone know what may be happening for getting these results out of the cumsum? Many thanks in advance.

最满意答案

当你按local.Authority&year分组时,它采用唯一值并将结果打印为1,-1,1所以更好的组仅由local.Authority,其中cumsum基于总值和结果1,0,1

df <- df %>% group_by(Local.Authority) %>% mutate(cum.to = cumsum(total)) > df Source: local data frame [3 x 8] Groups: Local.Authority [1] Provider.ID Local.Authority month year entry exit total cum.to <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1-102642676 Kent 10 2010 1 0 1 1 2 1-102642676 Kent 9 2011 0 1 -1 0 3 1-102642676 Kent 10 2014 1 0 1 1

When you group by local.Authority & year it takes unique values and print the result as 1,-1,1 so better group by only local.Authority where cumsum works based on total values and result 1,0,1

df <- df %>% group_by(Local.Authority) %>% mutate(cum.to = cumsum(total)) > df Source: local data frame [3 x 8] Groups: Local.Authority [1] Provider.ID Local.Authority month year entry exit total cum.to <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1-102642676 Kent 10 2010 1 0 1 1 2 1-102642676 Kent 9 2011 0 1 -1 0 3 1-102642676 Kent 10 2014 1 0 1 1

更多推荐

本文发布于:2023-07-25 22:25:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1267057.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:数据   dplyr   cumsum   data   grouped

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!