Pandas sort_values

Question

While doing data analysis on SF Salaries dataset from Kaggle ( https://www.kaggle.com/kaggle/sf-salaries ), I would like to know the ranking of overtime pay based on Year and JobTitle.

What I decided to get

My solution was:

df = df[['Year','JobTitle','OvertimePay']].copy()
df2 = df.sort_values('OvertimePay', ascending= False)

which turned out to be like this . Obviously, it didn't turn out as I expected. Besides index, it seems sorted inappropriately since 173547.73 should be followed by 163477.81, etc. Please help. Thank you.

Answer 1

I am not sure you've realized that each line corresponds to a different Employee. So when you do df = df[['Year','JobTitle','OvertimePay']].copy() , there are multiple occurrences of "Deputy Sheriff" in the same year, one for each employee. This can happen several times, because there are different employees with the same "JobTitle".

In order to achieve what you want, you could drop_duplicates and get only the high paid employees for each "Job Title" in a "Year". However, I advise you to analyze if this is really what you are looking for.

Here is the code I would use:

import numpy as np
import pandas as pd

df = pd.read_csv('Salaries.csv')
df['OvertimePay'] = df['OvertimePay'].replace("Not Provided",np.nan).astype(float)
df = df[['Year','JobTitle','OvertimePay']].copy()
df.drop_duplicates(subset=['Year','JobTitle'])
df2 = df.sort_values('OvertimePay', ascending= False)

EDIT : To change the format I would use something like:

print(df2.iloc[0:20,].to_string(header=['Year','JobTitle',''],index=False,justify='left',
                                formatters={'JobTitle':'{{:<{}s}}'.format(df2['JobTitle'].str.len().max()).format}))

Pandas sort_values

Question

1 answers

solution1
0 ACCPTED 2019-03-15 22:55:08

Pandas sort_values

Question

1 answers

solution1 0 ACCPTED 2019-03-15 22:55:08

solution1
0 ACCPTED 2019-03-15 22:55:08