I have two files with minor differences between the two. I want to output the values that are different so that I can see what changed. There are a lot of columns to compare.
Here's sample data (only difference in example is status on first row):
Data1
ID PROGRAM_CODE Status
123 888 Active
123 777 Active
345 777 Inactive
345 999 Active
678 666 Inactive
901 777 Inactive
901 888 Active
Data2
ID PROGRAM_CODE Status
123 888 BLAH
123 777 Active
345 777 Inactive
345 999 Active
678 666 Inactive
901 777 Inactive
901 888 Active
Desired Output:
ID Status_1 Status_2
123 Active Inactive
My current approach is to create a list of columns, merge the two dataframes, and then use the list of columns in a for loop to compare. I believe my code is comparing series and outputting the whole series if there is any difference at all. I just want to see the one row with different values. Also, this doesn't work if one field has a value and it is blank in the other dataframe.
Code:
import pandas as pd
df1 = pd.read_excel(r"P:\data_files\data1.xlsx")
df2 = pd.read_excel(r"P:\data_files\data2.xlsx")
# create list of columns
l1 = list(df1)
# dropping the join values from the list because I don't want to compare those
l1 = [e for e in l1 if e not in ('ID','PROGRAM_CODE')]
# merge dataframes
df3 = df1.merge(df2, how='outer', on=['ID','PROGRAM_CODE'], suffixes=['_1', '_2'])
for x in l1:
if df3[x+'_1'].any() != df3[x+'_2'].any():
print(df3[['ID', x+'_1',x+'_2']])
Output of above code: Shows all values for the Status column even though only the first row has values that are different between data frames.
ID Status_1 Status_2
123 Active Blah
123 Active Active
345 Inactive Inactive
345 Active Active
678 Inactive Inactive
901 Inactive Inactive
901 Active Active
Edit 12/12/17 The example from Wen below seems to work for one column, but I need to compare every row and column for two files where ID and Program_Code are the same.
I tried this loop:
for x in l1:
print(df3.groupby('STUDENT_CID').x.apply(list).apply(pd.Series).add_prefix(x+'_'))
but I get the following error:
AttributeError: 'DataFrameGroupBy' object has no attribute 'x'
I need a way to loop through every column (both files contain the same columns).
Additional Example:
Data File 1
ID PROGRAM_CODE I_CODE INSTITUTION TERM TYPE STATUS Hire_Date
123 888 111 ZBD Fall FINAL Active 1/1/2017 0:00
123 777 111 ZBD Fall FINAL Active 1/1/2017 0:00
345 777 125 GUB Fall FINAL Inactive 2/3/2017 0:00
345 999 125 GUB Fall FINAL Inactive 2/3/2017 0:00
678 999 111 ZBD Fall FINAL Active 1/1/2017 0:00
678 888 111 ZBD Fall FINAL Active 1/1/2017 0:00
901 888 654 YUI Fall FINAL Inactive 5/1/2017 0:00
901 777 654 YUI Fall FINAL Inactive 5/1/2017 0:00
Data File2
ID PROGRAM_CODE I_CODE INSTITUTION TERM TYPE STATUS Hire_Date
123 888 111 ZBD Fall FINAL Inactive 1/1/2017 0:00
123 777 111 ZBD Fall FINAL Active 1/1/2017 0:00
345 777 111 ZBD Fall FINAL Inactive 2/3/2017 0:00
345 999 111 ZBD Fall FINAL Inactive 2/3/2017 0:00
678 999 111 ZBD Fall FINAL Active 1/1/2017 0:00
678 888 111 ZBD Fall FINAL Active 1/1/2017 0:00
901 888 654 YUI Fall FINAL Inactive 5/1/2017 0:00
901 777 654 YUI Fall FINAL Inactive 5/1/2017 0:00
Desired Output
ID STATUS_1 STATUS_2
123 Active Inactive
ID INSTITUTION_1 INSTITUTION_2
345 125 111
We using pd.concat
+ drop_duplicates
df1=pd.concat([df1,df2]).drop_duplicates(keep=False)
df1
Out[1085]:
ID PROGRAM_CODE Status
0 123 888 Active
0 123 888 BLAH
Then we groupby
create the table you need
df1.groupby('ID').Status.apply(list).apply(pd.Series).add_prefix('Status_')
Out[1094]:
Status_0 Status_1
ID
123 Active BLAH
Updated
df=pd.concat([df1,df2]).drop_duplicates(keep=False)
dd=df.groupby('ID').agg(lambda x:sorted(set(x), key=list(x).index)).stack()
dd[dd.apply(len)>1]
Out[1194]:
ID
123 STATUS [Active, Inactive]
345 PROGRAM_CODE [777, 999]
I_CODE [125, 111]
INSTITUTION [GUB, ZBD]
I'm sure there are better ways to do it, but have you tried merging the dataframes (as you already are), creating a new column that compares Status_1 and Status_2, and then dropping any rows where that match is True? If you dropped that "do they match" column afterwards, I believe you'll wind up with your desired output.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.