简体   繁体   中英

Trying to find number of occurrences within a specific range

I have imported a CSV file including graduate data like grad_year, grad_major, grad_gender, gpa, etc...

The objective is to take the top 100 GPAs and determine how many of the graduates with the top 100 GPAs are females

I've tried sorting the data for the top 100 GPAs but then I'm getting stuck about how I can filter for just the females from this point

import pandas as pd 

grads_df = pd.read_csv('Users/Sas0908/Downloads/grads.csv')

sort_gpa = grads_df.sort_values(by=['gpa']).tail(100)

Here I'm getting stuck as I'm unsure of how I can filter sort_gpa by only those entities that have grad_gender == 'Female'

使用loc函数https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

sort_gpa.loc[sort_gpa['grad_gender']=='Female']

To get the top 100 sorted by GPA, you have it right except you can also pass in an additional argument ascending to change the sort order:

# sort with highest GPAs appearing at the top
sort_gpa.sort_values(by='gpa', ascending=False)

To get the first 100 rows of the DataFrame, you can use head (or tail as you did for the last 100 rows). But another common way is to use .iloc , which allows you to grab rows by position:

# gets the first 100 rows, positions 0 thru 99
sort_gpa.iloc[:100]

And finally, you want to know the number of females vs. males, you can use .value_counts() on a column:

# returns the counts of all values that appear in the column
sort_gpa['grad_gender'].value_counts()

Putting that all together, you have:

top_100 = sort_gpa.sort_values(by='gpa', ascending=False).iloc[:100]
top_100['grad_gender'].value_counts()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM