按组创建数据帧分片

编程入门行业动态更新时间:2024-10-25 16:26:45

本文介绍了按组创建数据帧分片的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有一个有3列的数据框 - location_id，customers，cluster。之前，我按数据将数据聚类为5个群集。因此，群集列包含值[0,1,2,3,4]。

我想将每个群集分成2个片段，以供下一个测试阶段使用。例如。 50-50切片，或30-70切片，或20-80切片。

问题 - 如何应用一个函数，将列添加到 data.groupby（'cluster'）？

理想结果

location_id客户集群切片 0 149213 132817 1 1 1 578371 76655 1 0 2 91703 74048 2 1 3 154868 62397 2 1 4 1022759 59162 2 0

更新 p>

@MaxU的解决方案让我走上了正确的道路。解决方案包括使用dataframe.assign函数添加新列，并检查当前索引/总索引长度以分配正确比例的一部分。但是，下面的代码在某种程度上不适合我。

testgroup =（data.groupby（'cluster' ） .apply（lambda x：x.assign（index1 =（np.arange（len（x）））））） testgroup =（testgroup.groupby（'cluster'） .apply（lambda x：x.assign（total_len = len（x）））） testgroup ['is_slice'] =（（testgroup ['index1']] / testgroup ['total_len']）<= 0.5） location_id customers cluster index1 total_len is_slice 0 149213 132817 1 0 12 True 1 578371 76655 1 1 12 True 2 91703 74048 1 2 12 True 3 154868 62397 1 3 12 True 4 1022759 59162 1 4 12 True 5 87016 58134 1 5 12 True 6 649432 56849 1 6 12假 7 219163 56802 1 7 12假 8 97704 54718 1 8 12假 9 248455 52806 1 9 12假 10 184828 52783 1 10 12假 11 152887 52565 1 11 12 False

解决方案
试试这个：
让我们让您的示例DF变大一些：
In [31]：df = pd.concat （[df] * 3，ignore_index = True） In [32]：df Out [32]： location_id customers cluster 0 149213 132817 1 1 578371 76655 1 2 91703 74048 2 3 154868 62397 2 4 1022759 59162 2 5 149213 132817 1 6 578371 76655 1 7 91703 74048 2 8 154868 62397 2 9 1022759 59162 2 10 149213 132817 1 11 578371 76655 1 12 91703 74048 2 13 154868 62397 2 14 1022759 59162 2
切片30-70：
在[34]中：（df.groupby（'cluster'） ...：.apply（lambda x：x.assign（slice =（（np.arange（len（x ））/ len（x））<= 0.3）.astype（np.uint8））） ...：.reset_index（level = 0，drop = True） ...： ...：出[34]： location_id客户集群切片 0 149213 132817 1 1 1 578371 76655 1 1 5 149213 132817 1 0 6 578371 76655 1 0 10 149213 132817 1 0 11 578371 76655 1 0 2 91703 74048 2 1 3 154868 62397 2 1 4 1022759 59162 2 1 7 91703 74048 2 0 8 154868 62397 2 0 9 1022759 59162 2 0 12 91703 74048 2 0 13 154868 62397 2 0 14 1022759 59162 2 0
切片20-80：
In [35]：（df.groupby（'cluster'） ...：.apply（lambda x：x.assign（slice = （（np.arange（len（x））/ len（x））≤0.2）.astype（np.uint8））） ...：.reset_index（level = 0，drop = True） ...） ... Out [35]： location_id客户集群切片 0 149213 132817 1 1 1 578371 76655 1 1 5 149213 132817 1 0 6 578371 76655 1 0 10 149213 132817 1 0 11 578371 76655 1 0 2 91703 74048 2 1 3 154868 62397 2 1 4 1022759 59162 2 0 7 91703 74048 2 0 8 154868 62397 2 0 9 1022759 59162 2 0 12 91703 74048 2 0 13 154868 62397 2 0 14 1022759 59162 2 0

I have a Dataframe with 3 columns - location_id, customers, cluster. Previously, I clustered by data into 5 clusters. Hence, the cluster column contain values [0, 1, 2, 3, 4].

I would like to separate each cluster into 2 slices for my next stage of testing. E.g. 50-50 slice, or 30-70 slice, or 20-80 slice.

Question - How do I apply a function that adds a column to data.groupby('cluster')?

Ideal Result
location_id customers cluster slice 0 149213 132817 1 1 1 578371 76655 1 0 2 91703 74048 2 1 3 154868 62397 2 1 4 1022759 59162 2 0
Update

@MaxU's solution put me on the right path. The solution involves using dataframe.assign function to add a new column, and a check for current index/ total index length to assign a slice of the correct proportions. However, the code below somehow did not work for me. I ended up splitting up the @MaxU's solution into separate steps and that worked.
testgroup= (data.groupby('cluster') .apply(lambda x: x.assign(index1=(np.arange(len(x)))) )) testgroup= (testgroup.groupby('cluster') .apply(lambda x: x.assign(total_len=len(x)) )) testgroup['is_slice'] = ((testgroup['index1']/testgroup['total_len']) <= 0.5) location_id customers cluster index1 total_len is_slice 0 149213 132817 1 0 12 True 1 578371 76655 1 1 12 True 2 91703 74048 1 2 12 True 3 154868 62397 1 3 12 True 4 1022759 59162 1 4 12 True 5 87016 58134 1 5 12 True 6 649432 56849 1 6 12 False 7 219163 56802 1 7 12 False 8 97704 54718 1 8 12 False 9 248455 52806 1 9 12 False 10 184828 52783 1 10 12 False 11 152887 52565 1 11 12 False

解决方案
Try this:

let's make your sample DF bit larger:
In [31]: df = pd.concat([df] * 3, ignore_index=True) In [32]: df Out[32]: location_id customers cluster 0 149213 132817 1 1 578371 76655 1 2 91703 74048 2 3 154868 62397 2 4 1022759 59162 2 5 149213 132817 1 6 578371 76655 1 7 91703 74048 2 8 154868 62397 2 9 1022759 59162 2 10 149213 132817 1 11 578371 76655 1 12 91703 74048 2 13 154868 62397 2 14 1022759 59162 2
slice 30-70:
In [34]: (df.groupby('cluster') ...: .apply(lambda x: x.assign(slice=((np.arange(len(x))/len(x)) <= 0.3).astype(np.uint8))) ...: .reset_index(level=0, drop=True) ...: ) ...: Out[34]: location_id customers cluster slice 0 149213 132817 1 1 1 578371 76655 1 1 5 149213 132817 1 0 6 578371 76655 1 0 10 149213 132817 1 0 11 578371 76655 1 0 2 91703 74048 2 1 3 154868 62397 2 1 4 1022759 59162 2 1 7 91703 74048 2 0 8 154868 62397 2 0 9 1022759 59162 2 0 12 91703 74048 2 0 13 154868 62397 2 0 14 1022759 59162 2 0
slice 20-80:
In [35]: (df.groupby('cluster') ...: .apply(lambda x: x.assign(slice=((np.arange(len(x))/len(x)) <= 0.2).astype(np.uint8))) ...: .reset_index(level=0, drop=True) ...: ) ...: Out[35]: location_id customers cluster slice 0 149213 132817 1 1 1 578371 76655 1 1 5 149213 132817 1 0 6 578371 76655 1 0 10 149213 132817 1 0 11 578371 76655 1 0 2 91703 74048 2 1 3 154868 62397 2 1 4 1022759 59162 2 0 7 91703 74048 2 0 8 154868 62397 2 0 9 1022759 59162 2 0 12 91703 74048 2 0 13 154868 62397 2 0 14 1022759 59162 2 0

更多推荐

按组创建数据帧分片

本文发布于:2023-11-28 15:44:23，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1643010.html

版权声明:本站内容均来自互联网，仅供演示用，请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系，我们将在24小时内删除。

分片数据

上一篇：在Windows CE 6.0和Windows Mobile 6.1下进行开发的权衡是什么?

下一篇： Mongo分片无法在分片之间拆分大型集合

发布评论取消回复

评论列表（有 0 条评论）

最近发表

荆门网站建设的重要性

win10蓝屏终止代码CRITICAL_PROCESS_DIED解决方法

您可以尝试添加 --skip-broken 选项来解决该问题您可以尝试执行：rpm -Va --nofiles --nodigest 解决方案

关于无线网络波动大的解决办法

Windows10 关于系统中断CPU占用过高导致电脑变卡的解决办法

VS 2019 点击页面自动定位到解决方案资源管理器目录位置

（亲测解决）VMware打开需要半天才进入、打开系统很慢、运行很慢解决办法

Typora官网下载的最新版本mac10.13以下版本用不了的解决办法

成功解决ModuleNotFoundError: No module named ‘torch._C‘

MySQL:由于找不到VCRUNTIME140_1.dll，无法继续执行代码。重新安装程序可能会解决此问题

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍！

热门文章

从源“http://localhost:5173”访问“...”处的 XMLHttpRequest 已被 CORS 策略阻止

币安API错误代码1102，未发送强制参数“时间戳”

如果我在bot telegram nodejs中使用editMessageMedia，我如何制作标题

在 Node.js 中从网络流创建 blob

使用 Node.js / ES6 如何设置 dotenv 文件的自定义路径？

使用 NODE.JS 和 html5 实现低延迟（50 毫秒）视频流

如何从nodejs连接laravel>laravel

使用nodejs观看目录

如果文件包含特定字符串，如何跳过 GitHub 工作流程步骤？

FirebaseError：无法从.env加载环境变量

标签列表

文件

如何在

Python

系统

java

方法

数据

错误

windows

函数

android

linux

教程

如何使用

代码

字符串

计算机

电脑

服务器

NET

应用程序

数组

PHP

MySQL

SQL

对象

项目

程序

数据库

word