I have a dataframe that looks like this
Australia Austria United Kingdom Vietnam
date
2020-01-30 9 0 1 2
2020-01-31 9 9 4 2
I would like to crate a new dataframe that inclues countries that have sum of their column > 4 and I do it
df1 = df[[i for i in df.columns if int(df[i].sum()) > 4]]
this gives me
Australia Austria United Kingdom
date
2020-01-30 9 0 1
2020-01-31 9 9 4
I now would like to sort the countries based on the sum of their column and than take the first 2
Australia Austria
date
2020-01-30 9 0
2020-01-31 9 9
I know I have to use sort_values and tail . I just can't workout how
IIUC, you can do:
s = df.sum()
df[s.sort_values(ascending=False).index[:2]]
Output:
Australia Austria
date
2020-01-30 9 0
2020-01-31 9 9
First filter for sum greater like 4
and then add Series.nlargest
for top2 sum and filter by index values:
s = df.sum()
df = df[s[s > 4].nlargest(2).index]
print (df)
Australia Austria
date
2020-01-30 9 0
2020-01-31 9 9
Details :
print (s)
Australia 18.0
Austria 9.0
United 5.0
Kingdom 4.0
Vietnam 0.0
dtype: float64
print (s[s > 4])
Australia 18.0
Austria 9.0
United 5.0
dtype: float64
print (s[s > 4].nlargest(2))
Australia 18.0
Austria 9.0
dtype: float64
print (s[s > 4].nlargest(2).index)
Index(['Australia', 'Austria'], dtype='object')
You can take the sum
of the dataframe along the first axis, sort_values
and take the first n
columns:
df[df.sum(0).sort_values(ascending=False)[:2].index]
Australia Austria
2020-01-30 9 0
2020-01-31 9 9
another way modifying your list comp slightly.
cols = df[[i for i in df.columns if int(df[i].sum()) > 4]].stack().groupby(level=1).sum().head(2).index
#would yield the same result df.stack().groupby(level=1).sum().head(2).index
df[cols]
Australia Austria
date
2020-01-30 9 0
2020-01-31 9 9
You can also do this inline using the .pipe
function, which helps if you don't want to define a variable for a temporary result:
df.pipe(lambda df: df.loc[:, df.sum().sort_values(ascending=False).index])
For example, you might have a pipeline:
new_df = (
df1
# Some example operations one might do:
.groupby('column')
.apply(sum).unstack()
.fillna(0).astype(int)
# Sort columns by total count:
.pipe(lambda df: df.loc[:, df.sum().sort_values(ascending=False).index])
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.