简体   繁体   中英

How to compare two dataframes and return a column with difference?

I am preparing a dataframe to store the change in employee skills.

I want to compare two tables with these labels: 'Employee Name', 'Skill Name', 'Year' and 'Score'. Some of employees were employed in and some of skills were added in the second year. I want to check if an employee or skill is missing within both dataframes and fill gaps so that the dataframes' shape is the same.

dataset = dataset[['Employee Name', 'Skill Name', 'Year', 'Score']]

min_y = dataset['Year'].min()
max_y = dataset['Year'].max()

ds1 = ds1.sort_values(['Employee Name', 'Skill Name'], ascending=[True, False])
ds2 = ds2.sort_values(['Employee Name', 'Skill Name'], ascending=[True, False])

ds1 = dataset[dataset['Year']==min_y].reset_index().drop(['index'], axis=1).drop(['Year'], axis=1)
ds2 = dataset[dataset['Year']==max_y].reset_index().drop(['index'], axis=1).drop(['Year'], axis=1)

dsBool = (ds1 != ds2).stack()
dsdiff = pd.concat([ds1.stack()[dsBool], ds2.stack()[dsBool]], axis=1)
dsdiff.columns=["Old", "New"]

Currently comparing these two tables causes an error because of the difference in shape between the two DataFrames: Can only compare identically-labeled DataFrame objects

Try making sure that both dataframes are indexed the same before comparison:

ds1 = dataset[dataset['Year']==min_y].drop(['Year'], axis=1).reset_index(drop=True)
ds2 = dataset[dataset['Year']==max_y].drop(['Year'], axis=1).reset_index(drop=True)

Then perform your comparison:

dsBool = (ds1 != ds2).stack()

Edit:

Actually, I think you're original post may have code in the wrong order. Try the following:

dataset = dataset[['Employee Name', 'Skill Name', 'Year', 'Score']]

dataset.sort_values(['Employee Name', 'Skill Name'], ascending=[True, False], inplace=True)

ds1 = dataset[dataset['Year'] == dataset['Year'].min()].drop(['Year'], axis=1).reset_index(drop=True)
ds2 = dataset[dataset['Year'] == dataset['Year'].max()].drop(['Year'], axis=1).reset_index(drop=True)

dsBool = (ds1 != ds2).stack()
dsdiff = pd.concat([ds1.stack()[dsBool], ds2.stack()[dsBool]], axis=1)
dsdiff.columns=["Old", "New"]

As understood, shape error is due to addition of new employees and updating the skills of existing employees. To find out the missing values, you can join these data-frames and then delete the entries that are repeating. This way, the only entries left would be the ones that are different in both the data-frames.

temp = pd.concat((ds1, ds2), axis = 0)
temp = temp.drop_duplicates(subset = 'Employee Name', keep = False, inplace = True)
# keep = False ensures that all repeating entries are considered duplicates

The temp data-frame now consists of all the entries which are different in the initial 2 data-frames. They can be searched and edited in those data-frames at the end of which their shape would match.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM