country state year area
usa iowa 2000 30
usa iowa 2001 30
usa iowa 2002 30
usa iowa 2003 30
usa kansas 2000 500
usa kansas 2001 500
usa kansas 2002 500
usa kansas 2003 500
usa washington 2000 245
usa washington 2001 245
usa washington 2002 245
usa washington 2003 245
In the dataframe above, I want to drop the rows where the % of total area < 10%. In this case that would be all rows with state as iowa. What is the best way to do it in pandas? I tried groupby but not sure how to proceed.
df.groupby('area').sum()
Another solution with drop_duplicates
and double boolean indexing
:
a = df.drop_duplicates(['state','area'])
print (a)
country state year area
0 usa iowa 2000 30
4 usa kansas 2000 500
8 usa washington 2000 245
states = a.loc[a.area.div(a.area.sum()) >.1, 'state']
print (states)
4 kansas
8 washington
Name: state, dtype: object
print (df[df.state.isin(states)])
country state year area
4 usa kansas 2000 500
5 usa kansas 2001 500
6 usa kansas 2002 500
7 usa kansas 2003 500
8 usa washington 2000 245
9 usa washington 2001 245
10 usa washington 2002 245
11 usa washington 2003 245
You want to take any of the area
values within each state
and sum them up. I take the first.
groupby('state').area.first().sum()
is the thing we normalize by. df[df.area.div(df.groupby('state').area.first().sum()) >= .1]
country state year area
4 usa kansas 2000 500
5 usa kansas 2001 500
6 usa kansas 2002 500
7 usa kansas 2003 500
8 usa washington 2000 245
9 usa washington 2001 245
10 usa washington 2002 245
11 usa washington 2003 245
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.