每行获取第一个非空值(Get first non-null value per row)
print (df) ID c1 c2 c3 c4 result 0 1 a b a NaN a 1 2 NaN cc dd cc cc 2 3 NaN ee ff ee ee 3 4 NaN NaN gg gg gg
print (df) ID c1 c2 c3 c4 result 0 1 a b a NaN a 1 2 NaN cc dd cc cc 2 3 NaN ee ff ee ee 3 4 NaN NaN gg gg gg
我有一个示例数据框显示如下。 对于每一行,我想首先检查c1,如果它不为空,则检查c2。 通过这种方式,找到第一个notnull列并将该值存储到列结果中。
ID c1 c2 c3 c4 result 1 a b a 2 cc dd cc 3 ee ff ee 4 gg gg我现在正在使用这种方式。 但我想知道是否有更好的方法(列名没有任何模式,这只是示例)
df["result"] = np.where(df["c1"].notnull(), df["c1"], None) df["result"] = np.where(df["result"].notnull(), df["result"], df["c2"]) df["result"] = np.where(df["result"].notnull(), df["result"], df["c3"]) df["result"] = np.where(df["result"].notnull(), df["result"], df["c4"]) df["result"] = np.where(df["result"].notnull(), df["result"], "unknown)当有很多列时,这种方法看起来不太好。
I have a sample dataframe show as below. For each line, I want to check the c1 first, if it is not null, then check c2. By this way, find the first notnull column and store that value to column result.
ID c1 c2 c3 c4 result 1 a b a 2 cc dd cc 3 ee ff ee 4 gg ggI am using this way for now. but I would like to know if there is a better method.(The column name do not have any pattern, this is just sample)
df["result"] = np.where(df["c1"].notnull(), df["c1"], None) df["result"] = np.where(df["result"].notnull(), df["result"], df["c2"]) df["result"] = np.where(df["result"].notnull(), df["result"], df["c3"]) df["result"] = np.where(df["result"].notnull(), df["result"], df["c4"]) df["result"] = np.where(df["result"].notnull(), df["result"], "unknown)When there are lots of columns, this method looks not good.
最满意答案
先使用回填NaN ,然后通过iloc选择第一列:
df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')要么:
df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')print (df) ID c1 c2 c3 c4 result 0 1 a b a NaN a 1 2 NaN cc dd cc cc 2 3 NaN ee ff ee ee 3 4 NaN NaN gg gg gg
性能 :
df = pd.concat([df] * 1000, ignore_index=True) In [220]: %timeit df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown') 100 loops, best of 3: 2.78 ms per loop In [221]: %timeit df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown') 100 loops, best of 3: 2.7 ms per loop #jpp solution In [222]: %%timeit ...: cols = df.iloc[:, 1:].T.apply(pd.Series.first_valid_index) ...: ...: df['result'] = [df.loc[i, cols[i]] for i in range(len(df.index))] ...: 1 loop, best of 3: 180 ms per loop #cᴏʟᴅsᴘᴇᴇᴅ' s solution In [223]: %timeit df['result'] = df.stack().groupby(level=0).first() 1 loop, best of 3: 606 ms per loopUse back filling NaNs first and then select first column by iloc:
df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')Or:
df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')print (df) ID c1 c2 c3 c4 result 0 1 a b a NaN a 1 2 NaN cc dd cc cc 2 3 NaN ee ff ee ee 3 4 NaN NaN gg gg gg
Performance:
df = pd.concat([df] * 1000, ignore_index=True) In [220]: %timeit df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown') 100 loops, best of 3: 2.78 ms per loop In [221]: %timeit df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown') 100 loops, best of 3: 2.7 ms per loop #jpp solution In [222]: %%timeit ...: cols = df.iloc[:, 1:].T.apply(pd.Series.first_valid_index) ...: ...: df['result'] = [df.loc[i, cols[i]] for i in range(len(df.index))] ...: 1 loop, best of 3: 180 ms per loop #cᴏʟᴅsᴘᴇᴇᴅ' s solution In [223]: %timeit df['result'] = df.stack().groupby(level=0).first() 1 loop, best of 3: 606 ms per loop更多推荐
发布评论