使用来自不同列的条件语句进行分组(Grouping By Using a Conditional Statement from a Different Column)

我需要根据以下标准对一个pandas Dataframe进行分组，它就像是一个ohlc聚合：

open = last where volume > 0, in case there is no entry with volume > 0 use overall last high = max low = min last = last volume = max

我目前对这类操作（ohlc聚合）的实现是：

ohlc_dict = { 'open': 'first', 'high': 'max', 'low': 'min', 'close': 'last', 'volume': 'sum', } df = df.groupby(pd.Grouper(freq='1Min',level=0,label='left')).agg(ohlc_dict)

我怎么解决这个问题？谢谢。

样品：

fi ts open high low close volume datetime 2017-11-17 12:35:00 0 0 0.96214 0.96214 0.96214 0.96214 0 2017-11-17 12:35:00 0 0 0.96214 0.96214 0.96214 0.96214 0 2017-11-17 12:35:00 0 0 0.96214 0.96220 0.96214 0.96220 0 2017-11-17 12:35:00 0 0 0.96214 0.96220 0.96214 0.96220 0 2017-11-17 12:35:00 0 0 0.96214 0.96220 0.96214 0.96220 0 2017-11-17 12:35:00 0 0 0.96213 0.96220 0.96213 0.96219 19 2017-11-17 12:35:00 0 0 0.96214 0.96220 0.96214 0.96219 0 2017-11-17 12:35:00 0 0 0.96214 0.96222 0.96214 0.96222 0 2017-11-17 12:35:00 0 0 0.96214 0.96222 0.96214 0.96220 0 2017-11-17 12:35:00 0 0 0.96214 0.96222 0.96214 0.96221 0 2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96223 0 2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96221 0 2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96220 0 2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96220 0 2017-11-17 12:35:00 0 0 0.96213 0.96223 0.96213 0.96220 29 2017-11-17 12:35:00 0 0 0.96213 0.96223 0.96213 0.96220 29 2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96221 0 2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96222 0

期望的输出：

fi ts open high low close volume datetime 2017-11-17 12:35:00 0 0 0.96213 0.96223 0.96213 0.96222 29

附加信息：

有两个数据源可以通过“卷”值来识别：

a. Volume = 0 (more frequent, less reliable) b. Volume > 0 (less frequent, more reliable)

作为'b'类型。更可靠，最好使用其开放值来输入'a'开放值。

至于是否最后聚合，说实话并不重要，其他聚合（第一，最大，最小）将起作用，因为开放值是一分钟内的第一个引用值（在此示例中）并且永远不会改变。

当与服务器的连接中断时，会出现错误值的问题。输入'a'数据不能处理这个并且会给我可能错误的值，类型'b'数据可以处理这个就好了并且会给我正确的值。

I need to group a pandas Dataframe according to the following criteria, it is like-ish a ohlc aggregation:

open = last where volume > 0, in case there is no entry with volume > 0 use overall last high = max low = min last = last volume = max

My current implementation for these kinds of operations (ohlc aggregation) is:

ohlc_dict = { 'open': 'first', 'high': 'max', 'low': 'min', 'close': 'last', 'volume': 'sum', } df = df.groupby(pd.Grouper(freq='1Min',level=0,label='left')).agg(ohlc_dict)

How can I solve this? Thank you.

Sample:

fi ts open high low close volume datetime 2017-11-17 12:35:00 0 0 0.96214 0.96214 0.96214 0.96214 0 2017-11-17 12:35:00 0 0 0.96214 0.96214 0.96214 0.96214 0 2017-11-17 12:35:00 0 0 0.96214 0.96220 0.96214 0.96220 0 2017-11-17 12:35:00 0 0 0.96214 0.96220 0.96214 0.96220 0 2017-11-17 12:35:00 0 0 0.96214 0.96220 0.96214 0.96220 0 2017-11-17 12:35:00 0 0 0.96213 0.96220 0.96213 0.96219 19 2017-11-17 12:35:00 0 0 0.96214 0.96220 0.96214 0.96219 0 2017-11-17 12:35:00 0 0 0.96214 0.96222 0.96214 0.96222 0 2017-11-17 12:35:00 0 0 0.96214 0.96222 0.96214 0.96220 0 2017-11-17 12:35:00 0 0 0.96214 0.96222 0.96214 0.96221 0 2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96223 0 2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96221 0 2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96220 0 2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96220 0 2017-11-17 12:35:00 0 0 0.96213 0.96223 0.96213 0.96220 29 2017-11-17 12:35:00 0 0 0.96213 0.96223 0.96213 0.96220 29 2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96221 0 2017-11-17 12:35:00 0 0 0.96214 0.96223 0.96214 0.96222 0

Desired Output:

fi ts open high low close volume datetime 2017-11-17 12:35:00 0 0 0.96213 0.96223 0.96213 0.96222 29

Additional Information:

There are two data sources which can be identified by their 'Volume' values:

a. Volume = 0 (more frequent, less reliable) b. Volume > 0 (less frequent, more reliable)

As type 'b.' is more reliable, it is preferrable to use its open value to type 'a' open value.

As to whether last aggregation, it doesn't really matter to be honest, other aggregations (first,max,min) would work, as the open value is the first quoted value in a minute (on this example) and never changes.

The issues of incorrect values arise when there is an interruption on the connection to the server. Type 'a' data cannot deal with this and will give me possibly wrong values, type 'b' data can deal with this just fine and will give me correct values.

最满意答案

您可以先按open列的last聚合：

ohlc_dict = { 'high': 'max', 'low': 'min', 'close': 'last', 'open':'last', 'volume':'sum' } g = df.groupby(pd.Grouper(freq='1Min',level=0,label='left')) df2 = g.agg(ohlc_dict) print (df2) low close high open volume datetime 2017-11-17 12:35:00 0.96213 0.96222 0.96223 0.96215 77

然后过滤掉所有0卷并仅聚合open最后一个值：

g1 = df[df['volume'] > 0].groupby(pd.Grouper(freq='1Min',level=0,label='left')) df1 = g1['open'].last().reindex(df2.index) print (df1) datetime 2017-11-17 12:35:00 0.96213 Freq: T, Name: open, dtype: float64

最后将DataFrame与to_frame和combine_first ：

df3 = df1.to_frame().combine_first(df2) print (df3) close high low open volume datetime 2017-11-17 12:35:00 0.96222 0.96223 0.96213 0.96213 77.0

使用条件自定义函数（较慢）：

def ohlc_func(x): a = x.loc[x['volume'] > 0, 'open'].tail(1) a = a.item() if len(a) == 1 else x['open'].tail(1)[0] b = x['high'].max() c = x['low'].min() d = x['close'].tail(1)[0] e = x['volume'].sum() col = ['open','high','low','close','volume'] return pd.Series([a,b,c,d,e], index=col) df = df.groupby(pd.Grouper(freq='1Min',level=0,label='left')).apply(ohlc_func) print (df) open high low close volume datetime 2017-11-17 12:35:00 0.96213 0.96223 0.96213 0.96222 77.0

You can first aggregate by last of opencolumn:

ohlc_dict = { 'high': 'max', 'low': 'min', 'close': 'last', 'open':'last', 'volume':'sum' } g = df.groupby(pd.Grouper(freq='1Min',level=0,label='left')) df2 = g.agg(ohlc_dict) print (df2) low close high open volume datetime 2017-11-17 12:35:00 0.96213 0.96222 0.96223 0.96215 77

Then filter out all 0 volumes and aggregate only last value of open:

g1 = df[df['volume'] > 0].groupby(pd.Grouper(freq='1Min',level=0,label='left')) df1 = g1['open'].last().reindex(df2.index) print (df1) datetime 2017-11-17 12:35:00 0.96213 Freq: T, Name: open, dtype: float64

Last combine both DataFrames to one with to_frame and combine_first:

df3 = df1.to_frame().combine_first(df2) print (df3) close high low open volume datetime 2017-11-17 12:35:00 0.96222 0.96223 0.96213 0.96213 77.0

Use custom function with condition (slowier):

def ohlc_func(x): a = x.loc[x['volume'] > 0, 'open'].tail(1) a = a.item() if len(a) == 1 else x['open'].tail(1)[0] b = x['high'].max() c = x['low'].min() d = x['close'].tail(1)[0] e = x['volume'].sum() col = ['open','high','low','close','volume'] return pd.Series([a,b,c,d,e], index=col) df = df.groupby(pd.Grouper(freq='1Min',level=0,label='left')).apply(ohlc_func) print (df) open high low close volume datetime 2017-11-17 12:35:00 0.96213 0.96223 0.96213 0.96222 77.0

更多推荐