简体   繁体   中英

Python - compare two columns in dataframe

I have two files with minor differences between the two. I want to output the values that are different so that I can see what changed. There are a lot of columns to compare.

Here's sample data (only difference in example is status on first row):

Data1

ID      PROGRAM_CODE    Status
123     888             Active
123     777             Active
345     777             Inactive
345     999             Active
678     666             Inactive
901     777             Inactive
901     888             Active

Data2

ID      PROGRAM_CODE    Status
123     888             BLAH
123     777             Active
345     777             Inactive
345     999             Active
678     666             Inactive
901     777             Inactive
901     888             Active

Desired Output:

ID      Status_1    Status_2
123     Active      Inactive

My current approach is to create a list of columns, merge the two dataframes, and then use the list of columns in a for loop to compare. I believe my code is comparing series and outputting the whole series if there is any difference at all. I just want to see the one row with different values. Also, this doesn't work if one field has a value and it is blank in the other dataframe.

Code:

import pandas as pd

df1 = pd.read_excel(r"P:\data_files\data1.xlsx")
df2 = pd.read_excel(r"P:\data_files\data2.xlsx")


# create list of columns
l1 = list(df1)


# dropping the join values from the list because I don't want to compare those
l1 = [e for e in l1 if e not in ('ID','PROGRAM_CODE')]

# merge dataframes
df3 = df1.merge(df2, how='outer', on=['ID','PROGRAM_CODE'], suffixes=['_1', '_2'])

for x in l1:
    if df3[x+'_1'].any() != df3[x+'_2'].any():
        print(df3[['ID', x+'_1',x+'_2']])

Output of above code: Shows all values for the Status column even though only the first row has values that are different between data frames.

ID      Status_1    Status_2
123     Active      Blah
123     Active      Active
345     Inactive    Inactive
345     Active      Active
678     Inactive    Inactive
901     Inactive    Inactive
901     Active      Active

Edit 12/12/17 The example from Wen below seems to work for one column, but I need to compare every row and column for two files where ID and Program_Code are the same.

I tried this loop:

for x in l1:
    print(df3.groupby('STUDENT_CID').x.apply(list).apply(pd.Series).add_prefix(x+'_'))

but I get the following error:

AttributeError: 'DataFrameGroupBy' object has no attribute 'x'

I need a way to loop through every column (both files contain the same columns).

Additional Example:

Data File 1

ID      PROGRAM_CODE    I_CODE  INSTITUTION TERM    TYPE    STATUS      Hire_Date
123     888             111     ZBD         Fall    FINAL   Active      1/1/2017 0:00
123     777             111     ZBD         Fall    FINAL   Active      1/1/2017 0:00
345     777             125     GUB         Fall    FINAL   Inactive    2/3/2017 0:00
345     999             125     GUB         Fall    FINAL   Inactive    2/3/2017 0:00
678     999             111     ZBD         Fall    FINAL   Active      1/1/2017 0:00
678     888             111     ZBD         Fall    FINAL   Active      1/1/2017 0:00
901     888             654     YUI         Fall    FINAL   Inactive    5/1/2017 0:00
901     777             654     YUI         Fall    FINAL   Inactive    5/1/2017 0:00

Data File2

ID      PROGRAM_CODE    I_CODE  INSTITUTION TERM    TYPE    STATUS      Hire_Date
123     888             111     ZBD         Fall    FINAL   Inactive    1/1/2017 0:00
123     777             111     ZBD         Fall    FINAL   Active      1/1/2017 0:00
345     777             111     ZBD         Fall    FINAL   Inactive    2/3/2017 0:00
345     999             111     ZBD         Fall    FINAL   Inactive    2/3/2017 0:00
678     999             111     ZBD         Fall    FINAL   Active      1/1/2017 0:00
678     888             111     ZBD         Fall    FINAL   Active      1/1/2017 0:00
901     888             654     YUI         Fall    FINAL   Inactive    5/1/2017 0:00
901     777             654     YUI         Fall    FINAL   Inactive    5/1/2017 0:00

Desired Output

ID  STATUS_1        STATUS_2
123 Active          Inactive

ID  INSTITUTION_1   INSTITUTION_2
345 125             111

We using pd.concat + drop_duplicates

df1=pd.concat([df1,df2]).drop_duplicates(keep=False)
df1
Out[1085]:
    ID  PROGRAM_CODE  Status
0  123           888  Active
0  123           888    BLAH

Then we groupby create the table you need

df1.groupby('ID').Status.apply(list).apply(pd.Series).add_prefix('Status_')
Out[1094]: 
    Status_0 Status_1
ID                   
123   Active     BLAH

Updated

df=pd.concat([df1,df2]).drop_duplicates(keep=False)
dd=df.groupby('ID').agg(lambda x:sorted(set(x), key=list(x).index)).stack()

dd[dd.apply(len)>1]
Out[1194]: 
ID               
123  STATUS          [Active, Inactive]
345  PROGRAM_CODE            [777, 999]
     I_CODE                  [125, 111]
     INSTITUTION             [GUB, ZBD]

I'm sure there are better ways to do it, but have you tried merging the dataframes (as you already are), creating a new column that compares Status_1 and Status_2, and then dropping any rows where that match is True? If you dropped that "do they match" column afterwards, I believe you'll wind up with your desired output.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM