While doing data analysis on SF Salaries dataset from Kaggle ( https://www.kaggle.com/kaggle/sf-salaries ), I would like to know the ranking of overtime pay based on Year and JobTitle.
My solution was:
df = df[['Year','JobTitle','OvertimePay']].copy()
df2 = df.sort_values('OvertimePay', ascending= False)
which turned out to be like this . Obviously, it didn't turn out as I expected. Besides index, it seems sorted inappropriately since 173547.73 should be followed by 163477.81, etc. Please help. Thank you.
I am not sure you've realized that each line corresponds to a different Employee. So when you do df = df[['Year','JobTitle','OvertimePay']].copy()
, there are multiple occurrences of "Deputy Sheriff" in the same year, one for each employee. This can happen several times, because there are different employees with the same "JobTitle".
In order to achieve what you want, you could drop_duplicates
and get only the high paid employees for each "Job Title" in a "Year". However, I advise you to analyze if this is really what you are looking for.
Here is the code I would use:
import numpy as np
import pandas as pd
df = pd.read_csv('Salaries.csv')
df['OvertimePay'] = df['OvertimePay'].replace("Not Provided",np.nan).astype(float)
df = df[['Year','JobTitle','OvertimePay']].copy()
df.drop_duplicates(subset=['Year','JobTitle'])
df2 = df.sort_values('OvertimePay', ascending= False)
EDIT : To change the format I would use something like:
print(df2.iloc[0:20,].to_string(header=['Year','JobTitle',''],index=False,justify='left',
formatters={'JobTitle':'{{:<{}s}}'.format(df2['JobTitle'].str.len().max()).format}))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.