我到目前为止是下面的代码,它工作正常,并带来了它应该的结果:如果没有给定previous c * b ,它会用previous c * b计算填充df['c'] 。 问题是我必须将它应用于更大的数据集len(df.index) = ca. 10.000 len(df.index) = ca. 10.000 ,所以我目前使用的函数是不合适的,因为我必须写几千次: df['c'] = df.apply(func, axis =1) 。 对于这个大小的数据集, while循环在pandas是没有选择的。 有任何想法吗?
import pandas as pd import numpy as np import datetime randn = np.random.randn rng = pd.date_range('1/1/2011', periods=10, freq='D') df = pd.DataFrame({'a': [None] * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},index=rng) df["c"] =np.NaN df["c"][0] = 1 df["c"][2] = 3 def func(x): if pd.notnull(x['c']): return x['c'] else: return df.iloc[df.index.get_loc(x.name) - 1]['c'] * x['b'] df['c'] = df.apply(func, axis =1) df['c'] = df.apply(func, axis =1) df['c'] = df.apply(func, axis =1) df['c'] = df.apply(func, axis =1) df['c'] = df.apply(func, axis =1) df['c'] = df.apply(func, axis =1) df['c'] = df.apply(func, axis =1)What I got so far is the code below and it works fine and brings the results it should: It fills df['c'] with the calculation previous c * b if there is no c given. The problem is that I have to apply this to a bigger data set len(df.index) = ca. 10.000, so the function I have so far is inappropriate since I would have to write a couple of thousand times: df['c'] = df.apply(func, axis =1). A while loop is no option in pandas for this size of dataset. Any ideas?
import pandas as pd import numpy as np import datetime randn = np.random.randn rng = pd.date_range('1/1/2011', periods=10, freq='D') df = pd.DataFrame({'a': [None] * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},index=rng) df["c"] =np.NaN df["c"][0] = 1 df["c"][2] = 3 def func(x): if pd.notnull(x['c']): return x['c'] else: return df.iloc[df.index.get_loc(x.name) - 1]['c'] * x['b'] df['c'] = df.apply(func, axis =1) df['c'] = df.apply(func, axis =1) df['c'] = df.apply(func, axis =1) df['c'] = df.apply(func, axis =1) df['c'] = df.apply(func, axis =1) df['c'] = df.apply(func, axis =1) df['c'] = df.apply(func, axis =1)最满意答案
这是解决再发问题的好方法。 在v0.16.2中会有这方面的文档(下周发布)。 查看关于numba的文档
这将是非常高效的,因为真正的繁重工作是在快速跳转的编译代码中完成的。
import pandas as pd import numpy as np from numba import jit rng = pd.date_range('1/1/2011', periods=10, freq='D') df = pd.DataFrame({'a': np.nan * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},index=rng) df.ix[0,"c"] = 1 df.ix[2,"c"] = 3 @jit def ffill(arr_b, arr_c): n = len(arr_b) assert len(arr_b) == len(arr_c) result = arr_c.copy() for i in range(1,n): if not np.isnan(arr_c[i]): result[i] = arr_c[i] else: result[i] = result[i-1]*arr_b[i] return result df['d'] = ffill(df.b.values, df.c.values) a b c d 2011-01-01 NaN 2 1 1 2011-01-02 NaN 3 NaN 3 2011-01-03 NaN 10 3 3 2011-01-04 NaN 3 NaN 9 2011-01-05 NaN 5 NaN 45 2011-01-06 NaN 8 NaN 360 2011-01-07 NaN 4 NaN 1440 2011-01-08 NaN 1 NaN 1440 2011-01-09 NaN 2 NaN 2880 2011-01-10 NaN 6 NaN 17280Here is a nice way of solving a recurrence problem. There will be docs on this in v0.16.2 (releasing next week). See docs for numba
This will be quite performant as the real heavy lifting is done in fast jit-ted compiled code.
import pandas as pd import numpy as np from numba import jit rng = pd.date_range('1/1/2011', periods=10, freq='D') df = pd.DataFrame({'a': np.nan * 10, 'b': [2, 3, 10, 3, 5, 8, 4, 1, 2, 6]},index=rng) df.ix[0,"c"] = 1 df.ix[2,"c"] = 3 @jit def ffill(arr_b, arr_c): n = len(arr_b) assert len(arr_b) == len(arr_c) result = arr_c.copy() for i in range(1,n): if not np.isnan(arr_c[i]): result[i] = arr_c[i] else: result[i] = result[i-1]*arr_b[i] return result df['d'] = ffill(df.b.values, df.c.values) a b c d 2011-01-01 NaN 2 1 1 2011-01-02 NaN 3 NaN 3 2011-01-03 NaN 10 3 3 2011-01-04 NaN 3 NaN 9 2011-01-05 NaN 5 NaN 45 2011-01-06 NaN 8 NaN 360 2011-01-07 NaN 4 NaN 1440 2011-01-08 NaN 1 NaN 1440 2011-01-09 NaN 2 NaN 2880 2011-01-10 NaN 6 NaN 17280更多推荐
发布评论