Python熊猫数据分档(Python Pandas data binning)

对任何缺乏清晰度的道歉......我是新来张贴在这里....

我正在与一个数据框架上的熊猫一起工作，我将非常感谢来自社区的任何意见。这里是数据框的样子......在这里看到截图

数据帧结构的截图

我的目标是创建一个独立的数据框架，它将x和y分别平均分配到基于5的T增量的分箱。

例如......对于范围0-5中的T =>平均值，将x，y值分别转换为x1bin，y1bin，然后对于范围5-10中的T =>平均值，将x，y值分别转换为x2bin，y2bin ...... ..for T在10-15的范围内=>平均值将x，y值分别转换为x3bin，y3bin .........并将其一直增加到tan135-140的T. 同时将数据索引到ID ...。表示保留ID为ID的数据。正如你可能已经注意到的那样，将会有一些带有NAN值的箱子，因为有相应的Ts，这很好......。

最后，了解我计算T ......的方式可能会有所帮助......这恰好是每个ID的A滚动时间......并且从每个新ID的0开始

df ['T'] = df.groupby（['ID']）.apply（lambda x：x-x.iloc [0]）/ np.timedelta64（1，'m'）

先谢谢你…。

Apologies for any lack of clarity...I am new to posting here....

I am working with pandas on a data frame, and I would appreciate any input from the community. Here is what the data frame looks like…...see screenshot here

screenshot of the data frame structure

my goal is to create an independent data frame that averages x and y into bins separately based on T increments of 5…..

Fore example….. for T in range 0-5 => average corresponding x, y values into x1bin, y1bin, then for T in range 5-10 => average corresponding x, y values into x2bin, y2bin……..for T in range 10-15 => average corresponding x, y values into x3bin, y3bin………and increment this all the way to T in tan135-140. In the meanwhile index the data to ID….meaning keep data belonging to ID with ID one. As you might have noticed there will be some bins with NAN values because there are corresponding Ts, and that is fine…….

Finally, it might be helpful to know the way I calculated the T…… which happens to be a rolling time of A per ID…….and starts from 0 with every new ID

df[’T’] = df.groupby(['ID’]).A.apply(lambda x: x - x.iloc[0]) / np.timedelta64(1, 'm')

Thank you in advance….

最满意答案

假设df为您的数据框

t_range = 5 t_ranges = np.arange(0,df['T'].max()+1,t_range) new_df = pd.DataFrame(columns=['t_range','x_avg','y_avg']) for i in range(1,len(t_ranges)): a = df[df['T']>=t_ranges[i-1]][df['T']<t_ranges[i]] x_avg = a['X'].mean() y_avg = a['Y'].mean() new_df = new_df.append({'t_range':t_range[i],'x_avg':x_avg,'y_avg:y_avg},ignore_index=True)

样本数据df

T X Y 0 1 2 3 1 2 3 4 2 3 4 5 3 4 5 6 4 5 6 7 5 6 7 8 6 7 8 9 7 8 9 1

使用t_range = 2

即0-2,2-4,4-6

示例输出new_df

t_range x_avg y_avg 0 2.0 2.5 3.5 1 4.0 4.5 5.5 2 6.0 6.5 7.5

编辑：如下所示更改x_avg和y_avg，它们将忽略数据中存在的任何NaN，使用t_range = 2

x_avg = a['X'].dropna().mean() y_avg = a['Y'].dropna().mean()

样本数据

T X Y 0 1 2.0 6.0 ---------------- t<2 1 2 NaN NaN 2 3 7.0 8.0 ---------------- t<4 and t>=2 3 4 10.0 11.0 4 5 NaN 14.0 ---------------- t<6 and t>=4 5 6 1.0 NaN 6 7 12.0 13.0 ---------------- t<8 and t>=6 7 8 1.0 2.0

产量

t_range x_avg y_avg 0 2.0 2.0 6.0 1 4.0 7.0 8.0 2 6.0 10.0 12.5 3 8.0 6.5 13.0

我希望这会在评论中回答你的第二个问题。如果您发现答案有帮助，请将其标记为已接受，方法是单击此答案边上的勾号

assuming df as your dataframe

t_range = 5 t_ranges = np.arange(0,df['T'].max()+1,t_range) new_df = pd.DataFrame(columns=['t_range','x_avg','y_avg']) for i in range(1,len(t_ranges)): a = df[df['T']>=t_ranges[i-1]][df['T']<t_ranges[i]] x_avg = a['X'].mean() y_avg = a['Y'].mean() new_df = new_df.append({'t_range':t_range[i],'x_avg':x_avg,'y_avg:y_avg},ignore_index=True)

sample data df

T X Y 0 1 2 3 1 2 3 4 2 3 4 5 3 4 5 6 4 5 6 7 5 6 7 8 6 7 8 9 7 8 9 1

using t_range = 2

i.e 0-2,2-4,4-6

sample output new_df

t_range x_avg y_avg 0 2.0 2.5 3.5 1 4.0 4.5 5.5 2 6.0 6.5 7.5

EDIT: change x_avg and y_avg as given below, they will ignore any NaNs present in the data , t_range = 2 used

x_avg = a['X'].dropna().mean() y_avg = a['Y'].dropna().mean()

sample data

T X Y 0 1 2.0 6.0 ---------------- t<2 1 2 NaN NaN 2 3 7.0 8.0 ---------------- t<4 and t>=2 3 4 10.0 11.0 4 5 NaN 14.0 ---------------- t<6 and t>=4 5 6 1.0 NaN 6 7 12.0 13.0 ---------------- t<8 and t>=6 7 8 1.0 2.0

output

t_range x_avg y_avg 0 2.0 2.0 6.0 1 4.0 7.0 8.0 2 6.0 10.0 12.5 3 8.0 6.5 13.0

I hope this answers your second query in the comment. If you found the answer helpful do mark it Accepted by clicking on the tick mark on the side of this answer

更多推荐

Python熊猫数据分档(Python Pandas data binning)

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表