按组创建数据帧分片

编程入门 行业动态 更新时间:2024-10-25 16:26:45
本文介绍了按组创建数据帧分片的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有一个有3列的数据框 - location_id,customers,cluster。 之前,我按数据将数据聚类为5个群集。因此,群集列包含值[0,1,2,3,4]。

我想将每个群集分成2个片段,以供下一个测试阶段使用。例如。 50-50切片,或30-70切片,或20-80切片。

问题 - 如何应用一个函数,将列添加到 data.groupby('cluster')?

理想结果

location_id客户集群切片 0 149213 132817 1 1 1 578371 76655 1 0 2 91703 74048 2 1 3 154868 62397 2 1 4 1022759 59162 2 0

更新 p>

@MaxU的解决方案让我走上了正确的道路。解决方案包括使用dataframe.assign函数添加新列,并检查当前索引/总索引长度以分配正确比例的一部分。但是,下面的代码在某种程度上不适合我。

testgroup =(data.groupby('cluster' ) .apply(lambda x:x.assign(index1 =(np.arange(len(x)))))) testgroup =(testgroup.groupby('cluster') .apply(lambda x:x.assign(total_len = len(x)))) testgroup ['is_slice'] =((testgroup ['index1']] / testgroup ['total_len'])<= 0.5) location_id customers cluster index1 total_len is_slice 0 149213 132817 1 0 12 True 1 578371 76655 1 1 12 True 2 91703 74048 1 2 12 True 3 154868 62397 1 3 12 True 4 1022759 59162 1 4 12 True 5 87016 58134 1 5 12 True 6 649432 56849 1 6 12假 7 219163 56802 1 7 12假 8 97704 54718 1 8 12假 9 248455 52806 1 9 12假 10 184828 52783 1 10 12假 11 152887 52565 1 11 12 False

解决方案

试试这个:

让我们让您的示例DF变大一些:

In [31]:df = pd.concat ([df] * 3,ignore_index = True) In [32]:df Out [32]: location_id customers cluster 0 149213 132817 1 1 578371 76655 1 2 91703 74048 2 3 154868 62397 2 4 1022759 59162 2 5 149213 132817 1 6 578371 76655 1 7 91703 74048 2 8 154868 62397 2 9 1022759 59162 2 10 149213 132817 1 11 578371 76655 1 12 91703 74048 2 13 154868 62397 2 14 1022759 59162 2

切片30-70:

在[34]中:(df.groupby('cluster') ...:.apply(lambda x:x.assign(slice =((np.arange(len(x ))/ len(x))<= 0.3).astype(np.uint8))) ...:.reset_index(level = 0,drop = True) ...: ...:出[34]: location_id客户集群切片 0 149213 132817 1 1 1 578371 76655 1 1 5 149213 132817 1 0 6 578371 76655 1 0 10 149213 132817 1 0 11 578371 76655 1 0 2 91703 74048 2 1 3 154868 62397 2 1 4 1022759 59162 2 1 7 91703 74048 2 0 8 154868 62397 2 0 9 1022759 59162 2 0 12 91703 74048 2 0 13 154868 62397 2 0 14 1022759 59162 2 0

切片20-80:

In [35]:(df.groupby('cluster') ...:.apply(lambda x:x.assign(slice = ((np.arange(len(x))/ len(x))≤0.2).astype(np.uint8))) ...:.reset_index(level = 0,drop = True) ...) ... Out [35]: location_id客户集群切片 0 149213 132817 1 1 1 578371 76655 1 1 5 149213 132817 1 0 6 578371 76655 1 0 10 149213 132817 1 0 11 578371 76655 1 0 2 91703 74048 2 1 3 154868 62397 2 1 4 1022759 59162 2 0 7 91703 74048 2 0 8 154868 62397 2 0 9 1022759 59162 2 0 12 91703 74048 2 0 13 154868 62397 2 0 14 1022759 59162 2 0

I have a Dataframe with 3 columns - location_id, customers, cluster. Previously, I clustered by data into 5 clusters. Hence, the cluster column contain values [0, 1, 2, 3, 4].

I would like to separate each cluster into 2 slices for my next stage of testing. E.g. 50-50 slice, or 30-70 slice, or 20-80 slice.

Question - How do I apply a function that adds a column to data.groupby('cluster')?

Ideal Result

location_id customers cluster slice 0 149213 132817 1 1 1 578371 76655 1 0 2 91703 74048 2 1 3 154868 62397 2 1 4 1022759 59162 2 0

Update

@MaxU's solution put me on the right path. The solution involves using dataframe.assign function to add a new column, and a check for current index/ total index length to assign a slice of the correct proportions. However, the code below somehow did not work for me. I ended up splitting up the @MaxU's solution into separate steps and that worked.

testgroup= (data.groupby('cluster') .apply(lambda x: x.assign(index1=(np.arange(len(x)))) )) testgroup= (testgroup.groupby('cluster') .apply(lambda x: x.assign(total_len=len(x)) )) testgroup['is_slice'] = ((testgroup['index1']/testgroup['total_len']) <= 0.5) location_id customers cluster index1 total_len is_slice 0 149213 132817 1 0 12 True 1 578371 76655 1 1 12 True 2 91703 74048 1 2 12 True 3 154868 62397 1 3 12 True 4 1022759 59162 1 4 12 True 5 87016 58134 1 5 12 True 6 649432 56849 1 6 12 False 7 219163 56802 1 7 12 False 8 97704 54718 1 8 12 False 9 248455 52806 1 9 12 False 10 184828 52783 1 10 12 False 11 152887 52565 1 11 12 False

解决方案

Try this:

let's make your sample DF bit larger:

In [31]: df = pd.concat([df] * 3, ignore_index=True) In [32]: df Out[32]: location_id customers cluster 0 149213 132817 1 1 578371 76655 1 2 91703 74048 2 3 154868 62397 2 4 1022759 59162 2 5 149213 132817 1 6 578371 76655 1 7 91703 74048 2 8 154868 62397 2 9 1022759 59162 2 10 149213 132817 1 11 578371 76655 1 12 91703 74048 2 13 154868 62397 2 14 1022759 59162 2

slice 30-70:

In [34]: (df.groupby('cluster') ...: .apply(lambda x: x.assign(slice=((np.arange(len(x))/len(x)) <= 0.3).astype(np.uint8))) ...: .reset_index(level=0, drop=True) ...: ) ...: Out[34]: location_id customers cluster slice 0 149213 132817 1 1 1 578371 76655 1 1 5 149213 132817 1 0 6 578371 76655 1 0 10 149213 132817 1 0 11 578371 76655 1 0 2 91703 74048 2 1 3 154868 62397 2 1 4 1022759 59162 2 1 7 91703 74048 2 0 8 154868 62397 2 0 9 1022759 59162 2 0 12 91703 74048 2 0 13 154868 62397 2 0 14 1022759 59162 2 0

slice 20-80:

In [35]: (df.groupby('cluster') ...: .apply(lambda x: x.assign(slice=((np.arange(len(x))/len(x)) <= 0.2).astype(np.uint8))) ...: .reset_index(level=0, drop=True) ...: ) ...: Out[35]: location_id customers cluster slice 0 149213 132817 1 1 1 578371 76655 1 1 5 149213 132817 1 0 6 578371 76655 1 0 10 149213 132817 1 0 11 578371 76655 1 0 2 91703 74048 2 1 3 154868 62397 2 1 4 1022759 59162 2 0 7 91703 74048 2 0 8 154868 62397 2 0 9 1022759 59162 2 0 12 91703 74048 2 0 13 154868 62397 2 0 14 1022759 59162 2 0

更多推荐

按组创建数据帧分片

本文发布于:2023-11-28 15:44:23,感谢您对本站的认可!
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:分片   数据

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!