pandas ，检查datetimeindex的重新采样的30分钟时间段中是否存在时间戳值

本文介绍了 pandas ，检查datetimeindex的重新采样的30分钟时间段中是否存在时间戳值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我用datetimeindex在熊猫中创建了重新采样的数据帧(DF1).我有一个带有datetimeindex和time列的单独的数据框(DF2).如果来自DF2的time实例落入DF1中datetimeindex的30分钟区间内.我想用DF1的30分钟bin中的相应speed标记DF2中的time的每个实例.

DF1

boat_id speed time 2015-01-13 09:00:00 28.000000 0.000000 2015-01-13 09:30:00 28.000000 0.723503 2015-01-13 10:00:00 28.000000 2.239399

DF2

id boat_id time state time 2015-01-18 16:09:03 319437 28 2015-01-18 16:09:03 2 2015-01-18 16:18:43 319451 28 2015-01-18 16:18:43 0 2015-03-01 09:39:51 507108 31 2015-03-01 09:39:51 1 2015-03-01 09:40:58 507109 31 2015-03-01 09:40:58 0

所需结果

id boat_id time state speed time 2015-01-18 16:09:03 319437 28 2015-01-18 16:09:03 2 nan 2015-01-18 16:18:43 319451 28 2015-01-18 16:18:43 0 nan 2015-03-01 09:39:51 507108 31 2015-03-01 09:39:51 1 2.239399 2015-03-01 09:40:58 507109 31 2015-03-01 09:40:58 0 2.239399

我创建了这个脚本来尝试执行此操作，但是我认为它失败了，因为DF1的datetimeindex是不可变的，因此我的timedelta请求没有为该块创建起点.我曾经想过，是否有可能将DF1的datetimeindex复制到对象可以更改的新列中，但我尚未对其进行管理，因此不能百分百确定逻辑.我很乐于修补，但此刻我已经停滞了一段时间，希望有人能对此有所启发.预先感谢.

for row in DF1.iterrows(): for dfrow in DF2.iterrows(): if dfrow[0] > row[0] - dt.timedelta(minutes=30) and dfrow[0] < row[0]: df['test'] = row[1]

解决方案

迭代的性能很低.更好的是使用矢量化解决方案.我使用了两次函数 merge . 文档.输入:

print df1 boat_id speed time 2015-03-01 09:00:00 28 0.000000 2015-03-01 09:30:00 28 0.723503 2015-03-01 10:00:00 28 2.239399 print df2 id boat_id time state time 2015-01-18 16:09:03 319437 28 2015-01-18 16:09:03 2 2015-01-18 16:18:43 319451 28 2015-01-18 16:18:43 0 2015-03-01 09:39:51 507108 31 2015-03-01 09:39:51 1 2015-03-01 09:40:58 507109 31 2015-03-01 09:40:58 0

我重置两个数据框的索引，并创建由1填充的帮助器列i.

df1 = df1.reset_index() df2 = df2.reset_index(drop=True) df1['i'] = df2['i'] = 1 print df1 time boat_id speed i 0 2015-03-01 09:00:00 28 0.000000 1 1 2015-03-01 09:30:00 28 0.723503 1 2 2015-03-01 10:00:00 28 2.239399 1 print df2 id boat_id time state i 0 319437 28 2015-01-18 16:09:03 2 1 1 319451 28 2015-01-18 16:18:43 0 1 2 507108 31 2015-03-01 09:39:51 1 1 3 507109 31 2015-03-01 09:40:58 0 1

然后我通过帮助程序列i合并了两个数据框.

df = df2.merge(df1, on='i', how='left') df = df.rename(columns={'time_y':'Bin_time', 'time_x':'time'}) print df id boat_id_x time state i Bin_time \ 0 319437 28 2015-01-18 16:09:03 2 1 2015-03-01 09:00:00 1 319437 28 2015-01-18 16:09:03 2 1 2015-03-01 09:30:00 2 319437 28 2015-01-18 16:09:03 2 1 2015-03-01 10:00:00 3 319451 28 2015-01-18 16:18:43 0 1 2015-03-01 09:00:00 4 319451 28 2015-01-18 16:18:43 0 1 2015-03-01 09:30:00 5 319451 28 2015-01-18 16:18:43 0 1 2015-03-01 10:00:00 6 507108 31 2015-03-01 09:39:51 1 1 2015-03-01 09:00:00 7 507108 31 2015-03-01 09:39:51 1 1 2015-03-01 09:30:00 8 507108 31 2015-03-01 09:39:51 1 1 2015-03-01 10:00:00 9 507109 31 2015-03-01 09:40:58 0 1 2015-03-01 09:00:00 10 507109 31 2015-03-01 09:40:58 0 1 2015-03-01 09:30:00 11 507109 31 2015-03-01 09:40:58 0 1 2015-03-01 10:00:00 boat_id_y speed 0 28 0.000000 1 28 0.723503 2 28 2.239399 3 28 0.000000 4 28 0.723503 5 28 2.239399 6 28 0.000000 7 28 0.723503 8 28 2.239399 9 28 0.000000 10 28 0.723503 11 28 2.239399

输出按bin时间进行过滤:

df = df[((df.time >= (df.Bin_time - dt.timedelta(minutes=30))) & (df.time <= df.Bin_time ))] df = df.drop(['Bin_time', 'id', 'boat_id_x', 'boat_id_y','state', 'i' ], axis=1 ) print df time speed 8 2015-03-01 09:39:51 2.239399 11 2015-03-01 09:40:58 2.239399

然后df被列time与数据帧df2合并.

df = df2.merge(df, on='time', how='left').reset_index(drop=True) df = df.drop([ 'i' ], axis=1 ) print df id boat_id time state speed 0 319437 28 2015-01-18 16:09:03 2 NaN 1 319451 28 2015-01-18 16:18:43 0 NaN 2 507108 31 2015-03-01 09:39:51 1 2.239399 3 507109 31 2015-03-01 09:40:58 0 2.239399

比较向量化和索引编制方法的方法，您可以在此处找到.. >

I have created a resampled data frame (DF1) in pandas with a datetimeindex. I have a separate dataframe (DF2) with a datetimeindex and time column. If an instance of time from DF2 falls within the 30 min bins of datetimeindex in DF1. I want to mark each instance of time in DF2 with the appropriate speed from the 30 min bin in DF1.

DF1

boat_id speed time 2015-01-13 09:00:00 28.000000 0.000000 2015-01-13 09:30:00 28.000000 0.723503 2015-01-13 10:00:00 28.000000 2.239399

DF2

id boat_id time state time 2015-01-18 16:09:03 319437 28 2015-01-18 16:09:03 2 2015-01-18 16:18:43 319451 28 2015-01-18 16:18:43 0 2015-03-01 09:39:51 507108 31 2015-03-01 09:39:51 1 2015-03-01 09:40:58 507109 31 2015-03-01 09:40:58 0

Desired Result

id boat_id time state speed time 2015-01-18 16:09:03 319437 28 2015-01-18 16:09:03 2 nan 2015-01-18 16:18:43 319451 28 2015-01-18 16:18:43 0 nan 2015-03-01 09:39:51 507108 31 2015-03-01 09:39:51 1 2.239399 2015-03-01 09:40:58 507109 31 2015-03-01 09:40:58 0 2.239399

I created this script to try and do this but I think it's failing because datetimeindex of DF1 is immutable and so my timedelta request doesn't create a start point for the chunk. One thought I had was if it would be possible to copy the datetimeindex of DF1 into a new column where the objects are mutable but I haven't managed it yet so am not 100% sure of the logic. I'm happy to tinker but at the moment i've been stalled for a while so was hoping someone else might have a few ideas. Thanks in advance.

for row in DF1.iterrows(): for dfrow in DF2.iterrows(): if dfrow[0] > row[0] - dt.timedelta(minutes=30) and dfrow[0] < row[0]: df['test'] = row[1]

解决方案

Performance of iterating is very low. Better is use vectorized solution. I use twice function merge. Docs.

Input:

print df1 boat_id speed time 2015-03-01 09:00:00 28 0.000000 2015-03-01 09:30:00 28 0.723503 2015-03-01 10:00:00 28 2.239399 print df2 id boat_id time state time 2015-01-18 16:09:03 319437 28 2015-01-18 16:09:03 2 2015-01-18 16:18:43 319451 28 2015-01-18 16:18:43 0 2015-03-01 09:39:51 507108 31 2015-03-01 09:39:51 1 2015-03-01 09:40:58 507109 31 2015-03-01 09:40:58 0

I reset index of both dataframes and create helper column i filled by 1.

df1 = df1.reset_index() df2 = df2.reset_index(drop=True) df1['i'] = df2['i'] = 1 print df1 time boat_id speed i 0 2015-03-01 09:00:00 28 0.000000 1 1 2015-03-01 09:30:00 28 0.723503 1 2 2015-03-01 10:00:00 28 2.239399 1 print df2 id boat_id time state i 0 319437 28 2015-01-18 16:09:03 2 1 1 319451 28 2015-01-18 16:18:43 0 1 2 507108 31 2015-03-01 09:39:51 1 1 3 507109 31 2015-03-01 09:40:58 0 1

Then I merged both dataframes by helper column i.

df = df2.merge(df1, on='i', how='left') df = df.rename(columns={'time_y':'Bin_time', 'time_x':'time'}) print df id boat_id_x time state i Bin_time \ 0 319437 28 2015-01-18 16:09:03 2 1 2015-03-01 09:00:00 1 319437 28 2015-01-18 16:09:03 2 1 2015-03-01 09:30:00 2 319437 28 2015-01-18 16:09:03 2 1 2015-03-01 10:00:00 3 319451 28 2015-01-18 16:18:43 0 1 2015-03-01 09:00:00 4 319451 28 2015-01-18 16:18:43 0 1 2015-03-01 09:30:00 5 319451 28 2015-01-18 16:18:43 0 1 2015-03-01 10:00:00 6 507108 31 2015-03-01 09:39:51 1 1 2015-03-01 09:00:00 7 507108 31 2015-03-01 09:39:51 1 1 2015-03-01 09:30:00 8 507108 31 2015-03-01 09:39:51 1 1 2015-03-01 10:00:00 9 507109 31 2015-03-01 09:40:58 0 1 2015-03-01 09:00:00 10 507109 31 2015-03-01 09:40:58 0 1 2015-03-01 09:30:00 11 507109 31 2015-03-01 09:40:58 0 1 2015-03-01 10:00:00 boat_id_y speed 0 28 0.000000 1 28 0.723503 2 28 2.239399 3 28 0.000000 4 28 0.723503 5 28 2.239399 6 28 0.000000 7 28 0.723503 8 28 2.239399 9 28 0.000000 10 28 0.723503 11 28 2.239399

Output is filtered by bin time:

df = df[((df.time >= (df.Bin_time - dt.timedelta(minutes=30))) & (df.time <= df.Bin_time ))] df = df.drop(['Bin_time', 'id', 'boat_id_x', 'boat_id_y','state', 'i' ], axis=1 ) print df time speed 8 2015-03-01 09:39:51 2.239399 11 2015-03-01 09:40:58 2.239399

And df is merged by column time with dataframe df2.

df = df2.merge(df, on='time', how='left').reset_index(drop=True) df = df.drop([ 'i' ], axis=1 ) print df id boat_id time state speed 0 319437 28 2015-01-18 16:09:03 2 NaN 1 319451 28 2015-01-18 16:18:43 0 NaN 2 507108 31 2015-03-01 09:39:51 1 2.239399 3 507109 31 2015-03-01 09:40:58 0 2.239399

Comparing vectorized and indexing approach you can found in similar answer here.

更多推荐

pandas ,检查datetimeindex的重新采样的30分钟时间段中是否存在时间戳值