使用特定时间间隔将大 pandas 时间序列数据帧分组

编程入门行业动态更新时间:2024-10-22 18:33:13

本文介绍了使用特定时间间隔将大 pandas 时间序列数据帧分组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有一个很大的csv文件，其中带有iso格式2015-04-01 10:26:41的时间戳数据.数据跨越数月，输入范围从相隔30秒到数小时不等.它的列是id，时间，速度.

I have a large csv file with time stamp data in the iso format 2015-04-01 10:26:41. The data span multiple months with entries ranging from 30 secs apart to multiple hours. It's columns are id, time, speed.

最终，我想按15分钟的时间间隔对数据进行分组，然后计算平均速度，但是在15分钟的时隙中有很多条目.

Ultimately I want to group data by a time interval of 15 mins, then calculate an average speed, for however many entries are in the 15 mins timeslot.

我正在尝试使用Pandas，因为它似乎具有可靠的时间序列工具，并且这样做可能很容易，但是我却遇到了第一个障碍.

I am trying to use Pandas because it seems like it has a solid time-series tools and it might be easy to do this, but I am falling at the first hurdle.

到目前为止，我已经将CSV导入为数据框，并且所有列的dtype为object.我已经按日期对数据进行了排序，现在正尝试按时间间隔对条目进行分组，这正是我在其中努力的地方.基于谷歌搜索，我尝试使用此代码df.resample('5min', how=sum) resample数据.在这里，我得到错误TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex.我正在考虑尝试groupby方法，也许像在df.groupby(lambda x:x.minutes + 5)中那样使用lambda，这会产生错误AttributeError: 'str' object has no attribute 'minutes'.

So far I have imported the CSV as a dataframe and, all columns have a dtype of object. I have sorted the data by date and am now trying to group the entries by a time interval which is where i'm struggling. Based around google searching, I have tried to resample the data using this code df.resample('5min', how=sum) Here I get the error TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex. I was thinking about trying the groupbymethod, perhaps using lambda as in df.groupby(lambda x:x.minutes + 5) which produces the error AttributeError: 'str' object has no attribute 'minutes'.

基本上，我对a)熊猫是否具有其可以识别的格式的时间序列数据感到困惑，因为它是dtype是object，并且b)如果它可以识别它，我似乎就不知道了缩短时间间隔.

Basically I'm a little confused as to a) whether pandas has the time-series data in a format it's recognising as it's dtype is object, and b) if it can recognize it I can't seem to get the time-intervals down.

热衷于学习是否有人能指出我正确的方向.

Keen to learn if anyone could point me in the right direction.

DF看起来像这样

0 1 2 3 0 id boat_id time speed 1 386226 32 2015-01-15 05:14:32 4.2343243 2 386285 32 2015-01-15 05:44:57 3.45234

推荐答案

首先，您似乎读了一个空白行.您可能要跳过文件pd.read_csv(filename, skiprows=1)中的第一行.

First, it looks like you read a blank row. You probably want to skip the first row in your file pd.read_csv(filename, skiprows=1).

您应该使用pd.to_datetime()将时间的文本表示形式转换为DatetimeIndex.

You should convert the text representation of the time into a DatetimeIndex using pd.to_datetime().

df.set_index(pd.to_datetime(df['time']), inplace=True)

然后您应该可以重新采样.

You should then be able to resample.