I am preparing a dataframe to store the change in employee skills.
I want to compare two tables with these labels: 'Employee Name', 'Skill Name', 'Year' and 'Score'. Some of employees were employed in and some of skills were added in the second year. I want to check if an employee or skill is missing within both dataframes and fill gaps so that the dataframes' shape is the same.
dataset = dataset[['Employee Name', 'Skill Name', 'Year', 'Score']]
min_y = dataset['Year'].min()
max_y = dataset['Year'].max()
ds1 = ds1.sort_values(['Employee Name', 'Skill Name'], ascending=[True, False])
ds2 = ds2.sort_values(['Employee Name', 'Skill Name'], ascending=[True, False])
ds1 = dataset[dataset['Year']==min_y].reset_index().drop(['index'], axis=1).drop(['Year'], axis=1)
ds2 = dataset[dataset['Year']==max_y].reset_index().drop(['index'], axis=1).drop(['Year'], axis=1)
dsBool = (ds1 != ds2).stack()
dsdiff = pd.concat([ds1.stack()[dsBool], ds2.stack()[dsBool]], axis=1)
dsdiff.columns=["Old", "New"]
Currently comparing these two tables causes an error because of the difference in shape between the two DataFrames: Can only compare identically-labeled DataFrame objects
Try making sure that both dataframes are indexed the same before comparison:
ds1 = dataset[dataset['Year']==min_y].drop(['Year'], axis=1).reset_index(drop=True)
ds2 = dataset[dataset['Year']==max_y].drop(['Year'], axis=1).reset_index(drop=True)
Then perform your comparison:
dsBool = (ds1 != ds2).stack()
Edit:
Actually, I think you're original post may have code in the wrong order. Try the following:
dataset = dataset[['Employee Name', 'Skill Name', 'Year', 'Score']]
dataset.sort_values(['Employee Name', 'Skill Name'], ascending=[True, False], inplace=True)
ds1 = dataset[dataset['Year'] == dataset['Year'].min()].drop(['Year'], axis=1).reset_index(drop=True)
ds2 = dataset[dataset['Year'] == dataset['Year'].max()].drop(['Year'], axis=1).reset_index(drop=True)
dsBool = (ds1 != ds2).stack()
dsdiff = pd.concat([ds1.stack()[dsBool], ds2.stack()[dsBool]], axis=1)
dsdiff.columns=["Old", "New"]
As understood, shape error is due to addition of new employees and updating the skills of existing employees. To find out the missing values, you can join these data-frames and then delete the entries that are repeating. This way, the only entries left would be the ones that are different in both the data-frames.
temp = pd.concat((ds1, ds2), axis = 0)
temp = temp.drop_duplicates(subset = 'Employee Name', keep = False, inplace = True)
# keep = False ensures that all repeating entries are considered duplicates
The temp data-frame now consists of all the entries which are different in the initial 2 data-frames. They can be searched and edited in those data-frames at the end of which their shape would match.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.