在多个Pandas列上连接2个Dataframe(Joining 2 Dataframes on multiple columns Pandas)

编程入门 行业动态 更新时间:2024-10-24 16:29:35
多个Pandas列上连接2个Dataframe(Joining 2 Dataframes on multiple columns Pandas)

考虑2个数据帧,需要使用2个唯一列(idA,idB)连接2个数据帧并计算其col距离的总和。 顺便说一下(idA,idB)等于(idB,idA),所以它们的距离必须相加

In [1]: df1 = pd.DataFrame({'idA': ['1', '2', '3', '2'], ...: 'idB': ['1', '4', '8', '1'], ...: 'Distance': ['0.727273', '0.827273', '0.127273', '0.927273']}, ...: index=[0, 1, 2, 3]) ...: In [2]: df2 = pd.DataFrame({'idA': ['1', '5', '2', '5'], ...: 'idB': ['2', '1', '4', '7'], ...: 'Distance': ['0.11', '0.1', '3.0', '0.8']}, ...: index=[4, 5, 6, 7])

输出必须是这样的:

Sum_Distance idA idB 0 0.727273 1 1 1 3.827273 2 4 <-- 2,4 = 3.0 + 2,4 = 0.827273 2 0.127273 3 8 3 1.037273 2 1 <-- 2,1 = 0.927273 + 1,2 = 0.11 4 0.1 5 1 5 0.8 5 7

使用Pandas / Spark帮助找到如何使用它的方法。

Consider 2 Dataframes and need to use joining of 2 dataframes by 2 unique columns (idA, idB) and compute sum of their col Distance . By the way (idA,idB) is equal to (idB,idA), so their Distance has to be summed

In [1]: df1 = pd.DataFrame({'idA': ['1', '2', '3', '2'], ...: 'idB': ['1', '4', '8', '1'], ...: 'Distance': ['0.727273', '0.827273', '0.127273', '0.927273']}, ...: index=[0, 1, 2, 3]) ...: In [2]: df2 = pd.DataFrame({'idA': ['1', '5', '2', '5'], ...: 'idB': ['2', '1', '4', '7'], ...: 'Distance': ['0.11', '0.1', '3.0', '0.8']}, ...: index=[4, 5, 6, 7])

The output has to be this way:

Sum_Distance idA idB 0 0.727273 1 1 1 3.827273 2 4 <-- 2,4 = 3.0 + 2,4 = 0.827273 2 0.127273 3 8 3 1.037273 2 1 <-- 2,1 = 0.927273 + 1,2 = 0.11 4 0.1 5 1 5 0.8 5 7

Help find the way how to do it using Pandas/Spark.

最满意答案

首先转换为数字两列,然后使用add with set_index对齐并按行对每对列进行排序:

df1['Distance'] = df1['Distance'].astype(float) df2['Distance'] = df2['Distance'].astype(float) #if some data are not parseable convert them to NaNs #df1['Distance'] = pd.to_numeric(df1['Distance'], errors='coerce') #df2['Distance'] = pd.to_numeric(df2['Distance'], errors='coerce') df1[['idA','idB']] = np.sort(df1[['idA','idB']], axis=1) df2[['idA','idB']] = np.sort(df2[['idA','idB']], axis=1) print (df1) Distance idA idB 0 0.727273 1 1 1 0.827273 2 4 2 0.127273 3 8 3 0.927273 1 2 print (df2) Distance idA idB 4 0.11 1 2 5 0.10 1 5 6 3.00 2 4 7 0.80 5 7
df3=df1.set_index(['idA','idB']).add(df2.set_index(['idA','idB']),fill_value=0).reset_index() print (df3) idA idB Distance 0 1 1 0.727273 1 1 2 1.037273 2 1 5 0.100000 3 2 4 3.827273 4 3 8 0.127273 5 5 7 0.800000

concat和groupby与sum另一个解决方案:

df3 = pd.concat([df1, df2]).groupby(['idA','idB'], as_index=False)['Distance'].sum() print (df3) idA idB Distance 0 1 1 0.727273 1 1 2 1.037273 2 1 5 0.100000 3 2 4 3.827273 4 3 8 0.127273 5 5 7 0.800000

First convert to numeric both columns and then use add with set_index for align and sort each pair of columns per rows:

df1['Distance'] = df1['Distance'].astype(float) df2['Distance'] = df2['Distance'].astype(float) #if some data are not parseable convert them to NaNs #df1['Distance'] = pd.to_numeric(df1['Distance'], errors='coerce') #df2['Distance'] = pd.to_numeric(df2['Distance'], errors='coerce') df1[['idA','idB']] = np.sort(df1[['idA','idB']], axis=1) df2[['idA','idB']] = np.sort(df2[['idA','idB']], axis=1) print (df1) Distance idA idB 0 0.727273 1 1 1 0.827273 2 4 2 0.127273 3 8 3 0.927273 1 2 print (df2) Distance idA idB 4 0.11 1 2 5 0.10 1 5 6 3.00 2 4 7 0.80 5 7
df3=df1.set_index(['idA','idB']).add(df2.set_index(['idA','idB']),fill_value=0).reset_index() print (df3) idA idB Distance 0 1 1 0.727273 1 1 2 1.037273 2 1 5 0.100000 3 2 4 3.827273 4 3 8 0.127273 5 5 7 0.800000

Another solution with concat and groupby with aggregate sum:

df3 = pd.concat([df1, df2]).groupby(['idA','idB'], as_index=False)['Distance'].sum() print (df3) idA idB Distance 0 1 1 0.727273 1 1 2 1.037273 2 1 5 0.100000 3 2 4 3.827273 4 3 8 0.127273 5 5 7 0.800000

更多推荐

本文发布于:2023-08-02 19:22:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1379993.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:多个   Dataframe   Pandas   Joining   multiple

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!