按组进行数据帧聚合

编程入门 行业动态 更新时间:2024-10-26 06:28:55
本文介绍了按组进行数据帧聚合 - 按照范围分隔列的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 我有一个数据框如下:

父< - c('a','b','c' ,'d','e','f','g','h','i','j','k','l',' m','n','o','p','q','r','s','t','u','v' ,'x','y','z') child< - c('A','B','C','D','E' 'F','G','H','I','J','K','L','M','N','O' ','Q','R','S','T','U','V','W','X',' 'Z')键入< - c('desktop','desktop','desktop','desktop','desktop','desktop','desktop','desktop','desktop','desktop','desktop','desktop','desktop','deskt 桌面,桌面,桌面,桌面,桌面,桌面,桌面,桌面,桌面 ,'desktop','desktop') Size< - c('MEDIUM','MEDIUM','LARGE','LARGE','SMALL','MEDIUM' 大,小,小,大,大,小,小,大,大 $ b'MEDIUM','SMALL','SMALL','MEDIUM','LARGE','MEDIUM','SMALL','MEDIUM','LARGE','MEDIUM' 收入< - c(22138.16,18617.94,12394.36,10535.76, 8901.41,7320.17,3821.40,2811.50, 2483.10,2145.76,2138.41,2037.67, 1950.52,1837.93,1737.68 ,1554.61, 1374.40,1334.02,1214.60,1191.41, 1189.56,1174.55,1162.80,1131.29, 1127.05,1108.53) NumberofSales< - c(1954720,5129937,108104104,970326, 1608012,746613,333424,236643, 352294,587541,209218,342455, 192670 ,340580,275260,248049, 251790,128845,303515,112218, 149878,226633,194973,103425, 101819,114570) Price <-C(11.325489, 3.629273,11.411762,10.857959 5.535661 9.804504 11.461083 11.880766 7.048374 3.652103 10.220966,5.550183 10.123631,5.36471,6.312868,6267350 5.458517,103.53681, 4.001779,10616924, 7.936855,5.182608,5.963908,10.938264, 11.069152,9.675570) Opps< - c(5144351,6038044,23454341,4578272, 7197544,474510,1045528 ,181471, 1071631,801038,928563,477870, 590497,849537,410179,432703, 198 3993,330478,939806,191824, 283107,575004,256846,249530, 142318,2036363) df< -data.frame(parent,child,Type,Size,收入,销售数量,价格,运营商)

这是它的样子:

df 父子类型大小收入NumberofSales价格Opps 1 a A桌面MEDIUM 22138.16 1954720 11.325489 5144351 2 b B桌面MEDIUM 18617.94 5129937 3.629273 6038044 3 c C桌面大号12394.36 1086104 11.411762 2354341 4 d D桌面大号10535.76 970326 10.857959 4578272 5 e E桌面小8901.41 1608012 5.535661 7197544 6 f F桌面MEDIUM 7320.17 746613 9.804504 474510 7 g G桌面大3821.40 333424 11.461083 1045528 8小时H桌面小2811.50 236643 11.880766 181471 9我桌面媒体2483.10 352294 7.048374 1071631 10 j J桌面SMALL 2145.76 587541 3.652103 801038 11 k K桌面LARGE 2138.41 209218 10.220966 928563 12 l L桌面LARGE 2037.67 342455 5.950183 477870 13 m M桌面SMALL 1950.52 192670 10.123631 590497 14 n N桌面SMALL 1837.93 340580 5.396471 849537 15 o O桌面LARGE 1737.68 275260 6.312868 410179 16 p P桌面LARGE 1554.61 248049 6.267350 432703 17 q Q桌面媒体1374.40 251790 5.458517 1983993 18 r R桌面SMALL 1334.02 128845 10.353681 330478 19 s S桌面SMALL 1214.60 303515 4.001779 939806 20 t T桌面媒体1191.41 112218 10.616924 191824 21 u U desktop LARGE 1189.56 149878 7.936855 283107 22 v V桌面媒体1174.55 226633 5.182608 575004 23 w W桌面SMALL 1162.80 194973 5.963908 256846 24 x X桌面媒体1131.29 103425 10.938264 249530 25 y Y桌面LARGE 1127.05 101819 11.069152 142318 26 z Z桌面媒体1108.53 114570 9.675570 2036363

我想创建一个数据框显示的价格 BY 大小和类型包含这些价格范围的所有适当指标。我想要最终的数据框看起来像这样。 (我没有做这个度量值的聚合,因为我目前正在做的太长了,这就是为什么他们现在都是一样的,但最终的答案应该有所有不同的值)

类型尺寸价格范围SUM_Opps SUM_NumberofSales SUM_Revenue 1桌面LARGE $ 3 $ 3.99 9,143,587 2,531,983 $ 8,453.93 1桌面LARGE $ 4- $ 4.99 9,143,587 2,531,983 $ 8,453.93 1桌面LARGE $ 5- $ 5.99 9,143,587 2,531,983 $ 8,453.93 1桌面LARGE $ 6- $ 6.99 9,143,587 2,531,983 $ 8,453.93 1桌面LARGE $ 7- $ 7.99 9,143,587 2,531,983 $ 8,453.93 1桌面LARGE $ 8- $ 8.99 9,143,587 2,531,983 $ 8,453.93 1桌面LARGE $ 9 $ 9.99 9,143,587 2,531,983 $ 8,453.93 1桌面LARGE $ 10- $ 10.99 9,143,587 2,531,983 $ 8,453.93 1桌面LARGE $ 11- $ 11.99 9,143,587 2,531,983 $ 8,453.93 1桌面LARGE $ 12- $ 12.99 9,143,587 2,531,983 $ 8,453.93 1桌面LARGE $ 13-大9,143,587 2,531,983 $ 8,453.93 1桌面媒体$ 3- $ 3.99 9,143,587 2,531,983 $ 8,453.93 1桌面媒体$ 4- $ 4.99 9,143,587 2,531,983 $ 8,453.93 1桌面媒体$ 5- $ 5.99 9,143,587 2,531,983 $ 8,453.93 1桌面媒体$ 6- $ 6.99 9,143,587 2,531,983 $ 8,453.93 1桌面媒体$ 7 - $ 7.99 9,143,587 2,531,983 $ 8,453.93 1桌面媒体$ 8- $ 8.99 9,143,587 2,531,983 $ 8,453.93 1桌面媒体$ 9- $ 9.99 9,143,587 2,531,983 $ 8,453.93 1桌面媒体$ 10- $ 10.99 9,143,587 2,531,983 $ 8,453.93 1桌面中档$ 11- $ 11.99 9,143,587 2,531,983 $ 8,453.93 1桌面媒体$ 12- $ 12.99 9,143,587 2,531,983 $ 8,453.93 1桌面媒体$ 13 - 大9,143,587 2,531,983 $ 8,453.93 1桌面小$ 3 $ 3.99 9,143,587 2,531,983 $ 8,453.93 1桌面SMALL $ 4 - $ 4.99 9,143,587 2,531,983 $ 8,453.93 1桌面SMALL $ 5- $ 5.99 9,143,587 2,531,983 $ 8,453.93 1桌面SMALL $ 6- $ 6.99 9,143,587 2,531,983 $ 8,453.93 1桌面小$ 7 $ 7.99 9,143,587 2,531,983 $ 8,453.93 1桌面小$ 8- $ 8.99 9,143,587 2,531,983 $ 8,453.93 1桌面小$ 9 $ 9.99 9,143,587 2,531,983 $ 8,453.93 1桌面小$ 10 $ 10.99 9,143,587 2,531,983 $ 8,453.93 1桌面小$ 11- $ 11.99 9,143,587 2,531,983 $ 8,453.93 1桌面小$ 12- $ 12.99 9,143,587 2,531,983 $ 8,453.93 1桌面小$ 13-大9,143,587 2,531,983 $ 8,453.93

如何创建上表?上表显示了 OPPS ,销售数量和收入 BY 类型,大小和价格范围。

我了解了如何使用dplyr进行简单的聚合,但艰难的部分是分配价格。

任何帮助都会很棒,谢谢!

解决方案

使用 Hmisc :: cut2()生成一个因子水平的价格仓:

code> library(Hmisc) library(dplyr) df $ cut_Price< - cut2(df $ Price,cutting = 4:13) df%>%group_by(cut_Price,Size,Type)%>% summarise_at(c(Opps,NumberofSales,Revenue),sum)%>%布置(Size,cut_Price)%>%ungroup()%>% mutate(cut_Price = gsub((。*,\\\\。))00,\ \199,cut_Price)) #一个字符串:16×6 cut_Price大小类型Opps Number ofSales收入< chr> < FCTR> < FCTR> < DBL> < DBL> < DBL> 1 [5.00,6.99] LARGE桌面477870 342455 2037.67 2 [6.00,7.99] LARGE桌面842882 523309 3292.29 3 [7.00,8.99] LARGE桌面283107 149878 1189.56 4 [10.00,11.00] LARGE桌面5506835 1179544 12674.17 5 [11.00,12.00] LARGE桌面3542187 1521347 17342.81 6 [3.63,4.99]媒体桌面6038044 5129937 18617.94 7 [5.00,6.99]媒体桌面2558997 478423 2548.95 8 [7.00,8.99)媒体桌面1071631 352294 2483.10 9 [9.00,10.00]媒体桌面2510873 861183 8428.70 10 [10.00,11.00)媒体桌面441354 215643 2322.70 11 [11.00,12.00] MEDIUM桌面5144351 1954720 22138.16 12 [3.63,4.99)小桌面801038 587541 2145.76 13 [4.00,5.99]小桌面939806 303515 1214.60 14 [5.00,6.99]小桌面8303927 2143565 11902.14 15 [10.00,11.00)小桌面920975 321515 3284.54 16 [11.00,12.00]小桌面181471 236643 2811.50

如果你想调整削减每0.5而不是1,你可以这样做,因为它的向量传递给 cut = ... 正在定义切点:

df $ cut_Price< - cut2(df $ Price,cutting = seq(4,13,.5))

I have a dataframe as follows:

parent<- c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z') child<- c('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z') Type<- c('desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop', 'desktop') Size<- c('MEDIUM', 'MEDIUM', 'LARGE', 'LARGE', 'SMALL', 'MEDIUM', 'LARGE', 'SMALL', 'MEDIUM', 'SMALL', 'LARGE', 'LARGE', 'SMALL', 'SMALL', 'LARGE', 'LARGE', 'MEDIUM', 'SMALL', 'SMALL', 'MEDIUM', 'LARGE', 'MEDIUM', 'SMALL', 'MEDIUM', 'LARGE', 'MEDIUM') Revenue<- c(22138.16, 18617.94, 12394.36, 10535.76, 8901.41, 7320.17, 3821.40, 2811.50, 2483.10, 2145.76, 2138.41, 2037.67, 1950.52, 1837.93, 1737.68, 1554.61, 1374.40, 1334.02, 1214.60, 1191.41, 1189.56, 1174.55, 1162.80, 1131.29, 1127.05, 1108.53) NumberofSales<- c(1954720, 5129937, 1086104, 970326, 1608012, 746613, 333424, 236643, 352294, 587541, 209218, 342455, 192670, 340580, 275260, 248049, 251790, 128845, 303515, 112218, 149878, 226633, 194973, 103425, 101819, 114570) Price<- c(11.325489, 3.629273, 11.411762, 10.857959, 5.535661, 9.804504, 11.461083, 11.880766, 7.048374, 3.652103, 10.220966, 5.950183, 10.123631, 5.396471, 6.312868, 6.267350, 5.458517, 10.353681, 4.001779, 10.616924, 7.936855, 5.182608, 5.963908, 10.938264, 11.069152, 9.675570) Opps<- c(5144351, 6038044, 2354341, 4578272, 7197544, 474510, 1045528, 181471, 1071631, 801038, 928563, 477870, 590497, 849537, 410179, 432703, 1983993, 330478, 939806, 191824, 283107, 575004, 256846, 249530, 142318, 2036363) df<-data.frame(parent, child, Type, Size, Revenue, NumberofSales, Price, Opps)

This is what it looks like:

df parent child Type Size Revenue NumberofSales Price Opps 1 a A desktop MEDIUM 22138.16 1954720 11.325489 5144351 2 b B desktop MEDIUM 18617.94 5129937 3.629273 6038044 3 c C desktop LARGE 12394.36 1086104 11.411762 2354341 4 d D desktop LARGE 10535.76 970326 10.857959 4578272 5 e E desktop SMALL 8901.41 1608012 5.535661 7197544 6 f F desktop MEDIUM 7320.17 746613 9.804504 474510 7 g G desktop LARGE 3821.40 333424 11.461083 1045528 8 h H desktop SMALL 2811.50 236643 11.880766 181471 9 i I desktop MEDIUM 2483.10 352294 7.048374 1071631 10 j J desktop SMALL 2145.76 587541 3.652103 801038 11 k K desktop LARGE 2138.41 209218 10.220966 928563 12 l L desktop LARGE 2037.67 342455 5.950183 477870 13 m M desktop SMALL 1950.52 192670 10.123631 590497 14 n N desktop SMALL 1837.93 340580 5.396471 849537 15 o O desktop LARGE 1737.68 275260 6.312868 410179 16 p P desktop LARGE 1554.61 248049 6.267350 432703 17 q Q desktop MEDIUM 1374.40 251790 5.458517 1983993 18 r R desktop SMALL 1334.02 128845 10.353681 330478 19 s S desktop SMALL 1214.60 303515 4.001779 939806 20 t T desktop MEDIUM 1191.41 112218 10.616924 191824 21 u U desktop LARGE 1189.56 149878 7.936855 283107 22 v V desktop MEDIUM 1174.55 226633 5.182608 575004 23 w W desktop SMALL 1162.80 194973 5.963908 256846 24 x X desktop MEDIUM 1131.29 103425 10.938264 249530 25 y Y desktop LARGE 1127.05 101819 11.069152 142318 26 z Z desktop MEDIUM 1108.53 114570 9.675570 2036363

I want to create a dataframe that shows the distribution of Price BY Size and Type with all of the appropriate metrics for these Price ranges. I want the final dataframe to look like this. ( I didn't do the aggregation for the metric values because it takes way too long the way I am currently doing it, that's why they are all the same right now but the final answer should have all different values)

Type Size Price Range SUM_Opps SUM_NumberofSales SUM_Revenue 1 desktop LARGE $3-$3.99 9,143,587 2,531,983 $8,453.93 1 desktop LARGE $4-$4.99 9,143,587 2,531,983 $8,453.93 1 desktop LARGE $5-$5.99 9,143,587 2,531,983 $8,453.93 1 desktop LARGE $6-$6.99 9,143,587 2,531,983 $8,453.93 1 desktop LARGE $7-$7.99 9,143,587 2,531,983 $8,453.93 1 desktop LARGE $8-$8.99 9,143,587 2,531,983 $8,453.93 1 desktop LARGE $9-$9.99 9,143,587 2,531,983 $8,453.93 1 desktop LARGE $10-$10.99 9,143,587 2,531,983 $8,453.93 1 desktop LARGE $11-$11.99 9,143,587 2,531,983 $8,453.93 1 desktop LARGE $12-$12.99 9,143,587 2,531,983 $8,453.93 1 desktop LARGE $13-Greater 9,143,587 2,531,983 $8,453.93 1 desktop MEDIUM $3-$3.99 9,143,587 2,531,983 $8,453.93 1 desktop MEDIUM $4-$4.99 9,143,587 2,531,983 $8,453.93 1 desktop MEDIUM $5-$5.99 9,143,587 2,531,983 $8,453.93 1 desktop MEDIUM $6-$6.99 9,143,587 2,531,983 $8,453.93 1 desktop MEDIUM $7-$7.99 9,143,587 2,531,983 $8,453.93 1 desktop MEDIUM $8-$8.99 9,143,587 2,531,983 $8,453.93 1 desktop MEDIUM $9-$9.99 9,143,587 2,531,983 $8,453.93 1 desktop MEDIUM $10-$10.99 9,143,587 2,531,983 $8,453.93 1 desktop MEDIUM $11-$11.99 9,143,587 2,531,983 $8,453.93 1 desktop MEDIUM $12-$12.99 9,143,587 2,531,983 $8,453.93 1 desktop MEDIUM $13-Greater 9,143,587 2,531,983 $8,453.93 1 desktop SMALL $3-$3.99 9,143,587 2,531,983 $8,453.93 1 desktop SMALL $4-$4.99 9,143,587 2,531,983 $8,453.93 1 desktop SMALL $5-$5.99 9,143,587 2,531,983 $8,453.93 1 desktop SMALL $6-$6.99 9,143,587 2,531,983 $8,453.93 1 desktop SMALL $7-$7.99 9,143,587 2,531,983 $8,453.93 1 desktop SMALL $8-$8.99 9,143,587 2,531,983 $8,453.93 1 desktop SMALL $9-$9.99 9,143,587 2,531,983 $8,453.93 1 desktop SMALL $10-$10.99 9,143,587 2,531,983 $8,453.93 1 desktop SMALL $11-$11.99 9,143,587 2,531,983 $8,453.93 1 desktop SMALL $12-$12.99 9,143,587 2,531,983 $8,453.93 1 desktop SMALL $13-Greater 9,143,587 2,531,983 $8,453.93

How do I create the table above? The table above is showing the sum of OPPS, Number of Sales, and Revenue BY Type, Size, and Price Range.

I understand how to use dplyr to do the simple aggregation but the tough part is doing the distribution of prices.

Any help would be great, thanks!

解决方案

you could use Hmisc::cut2() to generate you price bins as levels of a factor:

library(Hmisc) library(dplyr) df$cut_Price <- cut2(df$Price, cuts = 4:13) df %>% group_by(cut_Price, Size, Type) %>% summarise_at(c("Opps", "NumberofSales", "Revenue"),"sum") %>% arrange(Size, cut_Price) %>% ungroup() %>% mutate(cut_Price = gsub("(.*, \\d\\.)00", "\\199", cut_Price)) # A tibble: 16 × 6 cut_Price Size Type Opps NumberofSales Revenue <chr> <fctr> <fctr> <dbl> <dbl> <dbl> 1 [ 5.00, 6.99) LARGE desktop 477870 342455 2037.67 2 [ 6.00, 7.99) LARGE desktop 842882 523309 3292.29 3 [ 7.00, 8.99) LARGE desktop 283107 149878 1189.56 4 [10.00,11.00) LARGE desktop 5506835 1179544 12674.17 5 [11.00,12.00) LARGE desktop 3542187 1521347 17342.81 6 [ 3.63, 4.99) MEDIUM desktop 6038044 5129937 18617.94 7 [ 5.00, 6.99) MEDIUM desktop 2558997 478423 2548.95 8 [ 7.00, 8.99) MEDIUM desktop 1071631 352294 2483.10 9 [ 9.00,10.00) MEDIUM desktop 2510873 861183 8428.70 10 [10.00,11.00) MEDIUM desktop 441354 215643 2322.70 11 [11.00,12.00) MEDIUM desktop 5144351 1954720 22138.16 12 [ 3.63, 4.99) SMALL desktop 801038 587541 2145.76 13 [ 4.00, 5.99) SMALL desktop 939806 303515 1214.60 14 [ 5.00, 6.99) SMALL desktop 8303927 2143565 11902.14 15 [10.00,11.00) SMALL desktop 920975 321515 3284.54 16 [11.00,12.00) SMALL desktop 181471 236643 2811.50

if you want to adjust the cuts to every 0.5 instead of 1, you could do this since its the vector passed to cut = ... is defining the "cut points":

df$cut_Price <- cut2(df$Price, cuts = seq(4,13,.5))

更多推荐

按组进行数据帧聚合

本文发布于:2023-10-28 23:12:24,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1538041.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:数据

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!