简体   繁体   English

在多列熊猫上连接2个数据框

[英]Joining 2 Dataframes on multiple columns Pandas

Consider 2 Dataframes and need to use joining of 2 dataframes by 2 unique columns (idA, idB) and compute sum of their col Distance . 考虑2个数据帧,需要使用2个数据帧通过2个唯一列(idA,idB)的连接,并计算其col Distance的总和。 By the way (idA,idB) is equal to (idB,idA), so their Distance has to be summed 顺便说一句(idA,idB)等于(idB,idA),所以它们的距离必须相加

In [1]: df1 = pd.DataFrame({'idA': ['1', '2', '3', '2'],
   ...:                     'idB': ['1', '4', '8', '1'],
   ...:                     'Distance': ['0.727273', '0.827273', '0.127273', '0.927273']},
   ...:                     index=[0, 1, 2, 3])
   ...: 

In [2]: df2 = pd.DataFrame({'idA': ['1', '5', '2', '5'],
   ...:                     'idB': ['2', '1', '4', '7'],
   ...:                     'Distance': ['0.11', '0.1', '3.0', '0.8']},
   ...:                      index=[4, 5, 6, 7])

The output has to be this way: 输出必须是这样的:

    Sum_Distance    idA idB
  0  0.727273       1   1
  1  3.827273       2   4  <-- 2,4 = 3.0 + 2,4 = 0.827273
  2  0.127273       3   8
  3  1.037273       2   1  <-- 2,1 = 0.927273 + 1,2 = 0.11
  4  0.1            5   1
  5  0.8            5   7

Help find the way how to do it using Pandas/Spark. 帮助找到使用Pandas / Spark的方法。

First convert to numeric both columns and then use add with set_index for align and sort each pair of columns per rows: 首先将两列都转换为数字,然后使用带有set_index add对齐并按行对每一对列进行排序:

df1['Distance'] = df1['Distance'].astype(float)      
df2['Distance'] = df2['Distance'].astype(float)  

#if some data are not parseable convert them to NaNs 
#df1['Distance'] = pd.to_numeric(df1['Distance'], errors='coerce')      
#df2['Distance'] = pd.to_numeric(df2['Distance'], errors='coerce')  

df1[['idA','idB']] = np.sort(df1[['idA','idB']], axis=1)
df2[['idA','idB']] = np.sort(df2[['idA','idB']], axis=1) 

print (df1)
   Distance idA idB
0  0.727273   1   1
1  0.827273   2   4
2  0.127273   3   8
3  0.927273   1   2

print (df2)
   Distance idA idB
4      0.11   1   2
5      0.10   1   5
6      3.00   2   4
7      0.80   5   7   

df3=df1.set_index(['idA','idB']).add(df2.set_index(['idA','idB']),fill_value=0).reset_index()
print (df3)
  idA idB  Distance
0   1   1  0.727273
1   1   2  1.037273
2   1   5  0.100000
3   2   4  3.827273
4   3   8  0.127273
5   5   7  0.800000

Another solution with concat and groupby with aggregate sum : 使用concatgroupby sum另一个解决方案:

df3 = pd.concat([df1, df2]).groupby(['idA','idB'], as_index=False)['Distance'].sum()
print (df3)
  idA idB  Distance
0   1   1  0.727273
1   1   2  1.037273
2   1   5  0.100000
3   2   4  3.827273
4   3   8  0.127273
5   5   7  0.800000
df1.Distance=pd.to_numeric(df1.Distance)
df2.Distance=pd.to_numeric(df2.Distance)
df=pd.concat([df1.assign(key=df1.idA+df1.idB),df2.assign(key=df2.idA+df2.idB)]).\
    groupby('key').agg({'Distance':'sum','idA':'first','idB':'first'})
df
Out[672]: 
     Distance  idA  idB
key                    
2    0.727273    1    1
3    1.037273    2    1
6    3.927273    2    4
11   0.127273    3    8
12   0.800000    5    7

Updated 更新

df1[['idA','idB']]=np.sort(df1[['idA','idB']].values)
df2[['idA','idB']]=np.sort(df2[['idA','idB']].values)

pd.concat([df1,df2]).groupby(['idA','idB'],as_index=False).Distance.sum()
Out[678]: 
   idA  idB  Distance
0    1    1  0.727273
1    1    2  1.037273
2    1    5  0.100000
3    2    4  3.827273
4    3    8  0.127273
5    5    7  0.800000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM