简体   繁体   English

如何将具有相同列值的两个熊猫数据框合并以形成显示值差异的第三个数据框

[英]How two panda data frames with same column values can be merged to form the third data frame that shows the difference of the values

dataframe:df1 数据框:DF1

     name  age  id  salary
0   Smith   30   2    2000
1     Ron   24   3   30000
2    Mike   35   4   40000
3    Jack   21   5    5000
4  Roshan   20   6   60000
5   Steve   45   8    8000
6   Peter   32   1    1000

dataframe:df2 数据框:DF2

    name  age  salary  id
0  Peter   28   10000   1
1  Smith   30    1500   2
2    Ron   24    7000   3
3   Mike   35   20000   4
4   Jack   21    5000   5
5  Cathy   20    9000   6
6  Steve   45   56000   8

df1 and df2 To be merged on ID's.Please not that ID's are same in both df1 and df2 but id order is different.df3 needs to be created like below- df1和df2要在ID上合并。请不要让df1和df2中的ID相同,但ID顺序不同。df3需要如下创建:

     name       age    id    salary
0   Smith        30     2    2000|1500
1     Ron        24     3    30000|7000
2    Mike        35     4    40000 |20000
3    Jack        21     5    5000
4  Roshan|Cathy  20     6    60000|9000
5   Steve        45     8    8000|56000
6   Peter        32|28  1    1000|10000

I am planning to put the above output to excel sheet using to_excel functionality. 我打算使用to_excel功能将上述输出放到excel工作表中。 Before that i want to add one more extra column to this data frame which says 'match and 'mismatch' . 在此之前,我想在此数据帧中再添加一列,即“匹配和不匹配”。 Logic would be if any one of the row showing minimum of one difference value result should be mismatch else match.I am mocking the output below something like this- 逻辑上,如果显示最小一个差值结果的行中的任何一个应该不匹配否则匹配。我正在模拟这样的输出-

id age name salary Result 0 2 30 Smith 2000|1500 Mismatch 1 3 24 Ron 30000|7000 Mismatch 3 5 21 Jack 5000 Match 4 6 20 Roshan|Cathy 60000|9000 MisMatch 5 8 45 Steve 8000|56000 MisMatch 6 1 32|28 Peter 1000|10000 MisMatch id年龄名称薪水结果0 2 30 Smith 2000 | 1500不匹配1 3 24 Ron 30000 | 7000不匹配3 5 21 Jack 5000匹配4 6 20 Roshan | Cathy 60000 | 9000不匹配5 5 45 Steve 8000 | 56000不匹配6 1 32 | 28彼得1000 | 10000错误配对

What can i use for achieving such result 我可以用什么来达到这样的结果

Use merge first and then join columns by condition with numpy.where , last filter only columns by df1.columns : 首先使用merge ,然后按条件与numpy.where列,最后按df1.columns仅过滤列:

cols = df1.columns.difference(['id'])
df = df1.merge(df2, on='id', suffixes=('','_'))

s = df[cols].astype(str) + '|' + df[cols + '_'].astype(str).values
mask = df[cols].values != df[cols + '_'].values

arr =  np.where(mask, s, df[cols].astype(str))

df = df1[['id']].join(pd.DataFrame(arr, columns=cols))
print (df)
   id    age          name       salary
0   2     30         Smith    2000|1500
1   3     24           Ron   30000|7000
2   4     35          Mike  40000|20000
3   5     21          Jack         5000
4   6     20  Roshan|Cathy   60000|9000
5   8     45         Steve   8000|56000
6   1  32|28         Peter   1000|10000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM