简体   繁体   中英

Compare two columns of two different dataframes

Recently, I switched from matlab to python with pandas. It has been working great, but i am stuck at solving the following problem efficiently. For my analysis, I have to dataframes that look somewhat like this:

dfA =
     NUM      In        Date
0   2345    we 1    01/03/16
1   3631    we 1    23/02/16
2   2564    we 1    12/02/16
3   8785    sz 2    01/03/16
4   4767    dt 6    01/03/16
5   3452    dt 7    23/02/16
6   2134    sz 2    01/03/16
7   3465    sz 2    01/03/16

and

dfB
    In   Count_Num
0   we 1         3
1   sz 2         2
2   dt 6         3
3   dt 7         1

What I would like to perform is a an operation that sums all 'Num' for all "In" in dfA and compares it with the "Count_num" in dfB. Afterwards, I would like to add an column to dfB to return if the comparison is True or False. In the example above, the operation should return this:

dfB
    In   Count_Num   Check
0   we 1         3   True
1   sz 2         2   False
2   dt 6         1   True 
3   dt 7         1   True

My approach:

With value_counts() and pd.DataFrame, I constructed the following dfC from dfA dfC =

   In_Number       In_Total
0       we 1              4
1       sz 2              3
2       dt 6              1
3       dt 7              1

Then I merged it with dfB to check it afterwards if the values are the same by comparing the columns within dfB. In this case, I have to end dropping the columns. Is there a better/faster way to do this? I think there is a way to do this very efficiently with one of pandas great functions. I've tried to look into lookup and map , but I can not make it work.

Thanks for the help!

You can try merge dfB and dfA with groupby and count by column In , then add new column check for comparison merged columns and last drop column NUM :

print dfA
    NUM    In      Date
0  2345  we 1  01/03/16
1  3631  we 1  23/02/16
2  2564  we 1  12/02/16
3  8785  sz 2  01/03/16
4  4767  dt 6  01/03/16
5  3452  dt 7  23/02/16
6  2134  sz 2  01/03/16
7  3465  sz 2  01/03/16

print dfB
     In  Count_Num
0  we 1          3
1  sz 2          2
2  dt 6          3
3  dt 7          1
print dfA.groupby('In', as_index=False)['NUM'].count()
     In  NUM
0  dt 6    1
1  dt 7    1
2  sz 2    3
3  we 1    3

df = pd.merge(dfB, dfA.groupby('In', as_index=False)['NUM'].count(), on=['In'])
print df
     In  Count_Num  NUM
0  we 1          3    3
1  sz 2          2    3
2  dt 6          3    1
3  dt 7          1    1

df['check'] = df['NUM'] == df['Count_Num']
df = df.drop('NUM', axis=1)
print df
     In  Count_Num  check
0  we 1          3   True
1  sz 2          2  False
2  dt 6          3  False
3  dt 7          1   True

Or you can use rename without drop :

df = pd.merge(dfB, dfA.groupby('In', as_index=False)['NUM'].count(), on=['In'])
print df
     In  Count_Num  NUM
0  we 1          3    3
1  sz 2          2    3
2  dt 6          3    1
3  dt 7          1    1

df['NUM'] = df['NUM'] == df['Count_Num'] 
df = df.rename(columns={'NUM':'Check'})
print df
     In  Count_Num  Check
0  we 1          3   True
1  sz 2          2  False
2  dt 6          3  False
3  dt 7          1   True

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM