考虑到您有一些不均匀的时间序列数据:
Consider you've got some unevenly time series data:
import pandas as pd import random as randy ts = pd.Series(range(1000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e6,freq='U'),1000)).sort_index() print ts.head() 2013-02-01 09:00:00.002895 995 2013-02-01 09:00:00.003765 499 2013-02-01 09:00:00.003838 797 2013-02-01 09:00:00.004727 295 2013-02-01 09:00:00.006287 253比方说,我想在1ms的时间内进行滚动求和:
Let's say I wanted to do the rolling sum over a 1ms window to get this:
2013-02-01 09:00:00.002895 995 2013-02-01 09:00:00.003765 499 + 995 2013-02-01 09:00:00.003838 797 + 499 + 995 2013-02-01 09:00:00.004727 295 + 797 + 499 2013-02-01 09:00:00.006287 253目前,我将所有内容都放回多头,并在cython中进行,但是在纯熊猫中有可能吗?我知道您可以执行.asfreq('U')之类的操作,然后填充并使用传统函数,但是一旦行数超过玩具数量,就无法缩放.
Currently, I cast everything back to longs and do this in cython, but is this possible in pure pandas? I'm aware that you can do something like .asfreq('U') and then fill and use the traditional functions but this doesn't scale once you've got more than a toy # of rows.
作为参考,这是一个骇人的,不是快速的Cython版本:
As a point of reference, here's a hackish, not fast Cython version:
%%cython import numpy as np cimport cython cimport numpy as np ctypedef np.double_t DTYPE_t def rolling_sum_cython(np.ndarray[long,ndim=1] times, np.ndarray[double,ndim=1] to_add, long window_size): cdef long t_len = times.shape[0], s_len = to_add.shape[0], i =0, win_size = window_size, t_diff, j, window_start cdef np.ndarray[DTYPE_t, ndim=1] res = np.zeros(t_len, dtype=np.double) assert(t_len==s_len) for i in range(0,t_len): window_start = times[i] - win_size j = i while times[j]>= window_start and j>=0: res[i] += to_add[j] j-=1 return res在稍大的系列中对此进行演示:
Demonstrating this on a slightly larger series:
ts = pd.Series(range(100000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e8,freq='U'),100000)).sort_index() %%timeit res2 = rolling_sum_cython(ts.index.astype(int64),ts.values.astype(double),long(1e6)) 1000 loops, best of 3: 1.56 ms per loop推荐答案
您可以使用求和和二进制搜索解决大多数此类问题.
You can solve most problems of this sort with cumsum and binary search.
from datetime import timedelta def msum(s, lag_in_ms): lag = s.index - timedelta(milliseconds=lag_in_ms) inds = np.searchsorted(s.index.astype(np.int64), lag.astype(np.int64)) cs = s.cumsum() return pd.Series(cs.values - cs[inds].values + s[inds].values, index=s.index) res = msum(ts, 100) print pd.DataFrame({'a': ts, 'a_msum_100': res}) a a_msum_100 2013-02-01 09:00:00.073479 5 5 2013-02-01 09:00:00.083717 8 13 2013-02-01 09:00:00.162707 1 14 2013-02-01 09:00:00.171809 6 20 2013-02-01 09:00:00.240111 7 14 2013-02-01 09:00:00.258455 0 14 2013-02-01 09:00:00.336564 2 9 2013-02-01 09:00:00.536416 3 3 2013-02-01 09:00:00.632439 4 7 2013-02-01 09:00:00.789746 9 9 [10 rows x 2 columns]您需要一种处理NaN的方法,并且根据您的应用程序,您可能需要延迟时间前后的通用值(即,使用kdb + bin与np.searchsorted之间的差异).
You need a way of handling NaNs and depending on your application, you may need the prevailing value asof the lagged time or not (ie difference between using kdb+ bin vs np.searchsorted).
希望这会有所帮助.
更多推荐
滑动窗口上的 pandas 滚动计算(不均匀分布)
发布评论