简体   繁体   中英

How to find mean with group by or pivot table in pandas dataframe?

I am using salaries.csv dataset which you find https://www.kaggle.com/kaggle/sf-salaries/data I try to find job titles that have more than 500 datapoints.After that calculate the mean TotalPayBenefits for each of the job titles. Output is that print the top-10 earning job titles. 在此处输入图片说明

What I did,

salaries = pd.read_csv('Salaries.csv')
salaries = salaries.drop(["Id", "Notes", "Status", "Agency"], axis = 1)
salaries = salaries.dropna()
salaries.head()

jobtitlelist = (salaries.JobTitle.value_counts()>500)[0:10]
data_10jobtitle = salaries[salaries.JobTitle.isin(jobtitlelist.index)]
avgsalary_10jobtitle = data_10jobtitle.groupby(by=data_10jobtitle.JobTitle).TotalPayBenefits.mean()
print(avgsalary_10jobtitle)

My output is 在此处输入图片说明

I am thinking that i miss small things which i do not find exact output.

您需要更改此行

jobtitlelist = salaries.JobTitle.value_counts()[(salaries.JobTitle.value_counts()>500)][0:10]

In this line:

jobtitlelist = (salaries.JobTitle.value_counts()>500)[0:10]

You first find jobs that have at least 500 records, then you take the top 10 jobs, which are used to compute the average total pay benefits. So your workflow is

  1. keep only job titles that have at least 500 records
  2. take the first 10 job titles
  3. compute average total pay

But based on your question, your workflow should be

  1. keep only job titles that have at least 500 records
  2. compute average total pay of jobs from step 1)
  3. sort average total pay in ascending order
  4. the top 10 rows of the resulted dataframe will be what you are looking for

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM