简体   繁体   中英

Pandas sort_values

While doing data analysis on SF Salaries dataset from Kaggle ( https://www.kaggle.com/kaggle/sf-salaries ), I would like to know the ranking of overtime pay based on Year and JobTitle.

What I decided to get

My solution was:

df = df[['Year','JobTitle','OvertimePay']].copy()
df2 = df.sort_values('OvertimePay', ascending= False)

which turned out to be like this . Obviously, it didn't turn out as I expected. Besides index, it seems sorted inappropriately since 173547.73 should be followed by 163477.81, etc. Please help. Thank you.

I am not sure you've realized that each line corresponds to a different Employee. So when you do df = df[['Year','JobTitle','OvertimePay']].copy() , there are multiple occurrences of "Deputy Sheriff" in the same year, one for each employee. This can happen several times, because there are different employees with the same "JobTitle".

In order to achieve what you want, you could drop_duplicates and get only the high paid employees for each "Job Title" in a "Year". However, I advise you to analyze if this is really what you are looking for.

Here is the code I would use:

import numpy as np
import pandas as pd

df = pd.read_csv('Salaries.csv')
df['OvertimePay'] = df['OvertimePay'].replace("Not Provided",np.nan).astype(float)
df = df[['Year','JobTitle','OvertimePay']].copy()
df.drop_duplicates(subset=['Year','JobTitle'])
df2 = df.sort_values('OvertimePay', ascending= False)

EDIT : To change the format I would use something like:

print(df2.iloc[0:20,].to_string(header=['Year','JobTitle',''],index=False,justify='left',
                                formatters={'JobTitle':'{{:<{}s}}'.format(df2['JobTitle'].str.len().max()).format}))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM