[英]sort pandas dataframe by sum of columns
I have a dataframe that looks like this我有一个看起来像这样的数据框
Australia Austria United Kingdom Vietnam
date
2020-01-30 9 0 1 2
2020-01-31 9 9 4 2
I would like to crate a new dataframe that inclues countries that have sum of their column > 4 and I do it我想创建一个新的数据框,其中包含列总和 > 4 的国家,我这样做了
df1 = df[[i for i in df.columns if int(df[i].sum()) > 4]]
this gives me这给了我
Australia Austria United Kingdom
date
2020-01-30 9 0 1
2020-01-31 9 9 4
I now would like to sort the countries based on the sum of their column and than take the first 2我现在想根据列的总和对国家进行排序,而不是取前 2 个
Australia Austria
date
2020-01-30 9 0
2020-01-31 9 9
I know I have to use sort_values and tail .我知道我必须使用sort_values和tail 。 I just can't workout how我就是不能锻炼怎么办
IIUC, you can do: IIUC,你可以这样做:
s = df.sum()
df[s.sort_values(ascending=False).index[:2]]
Output:输出:
Australia Austria
date
2020-01-30 9 0
2020-01-31 9 9
First filter for sum greater like 4
and then add Series.nlargest
for top2 sum and filter by index values:首先过滤总和大于4
,然后为 top2 总和添加Series.nlargest
并按索引值过滤:
s = df.sum()
df = df[s[s > 4].nlargest(2).index]
print (df)
Australia Austria
date
2020-01-30 9 0
2020-01-31 9 9
Details :详情:
print (s)
Australia 18.0
Austria 9.0
United 5.0
Kingdom 4.0
Vietnam 0.0
dtype: float64
print (s[s > 4])
Australia 18.0
Austria 9.0
United 5.0
dtype: float64
print (s[s > 4].nlargest(2))
Australia 18.0
Austria 9.0
dtype: float64
print (s[s > 4].nlargest(2).index)
Index(['Australia', 'Austria'], dtype='object')
You can take the sum
of the dataframe along the first axis, sort_values
and take the first n
columns:您可以沿第一个轴sort_values
取数据sort_values
的sum
,并取前n
列:
df[df.sum(0).sort_values(ascending=False)[:2].index]
Australia Austria
2020-01-30 9 0
2020-01-31 9 9
another way modifying your list comp slightly.另一种稍微修改您的列表组合的方法。
cols = df[[i for i in df.columns if int(df[i].sum()) > 4]].stack().groupby(level=1).sum().head(2).index
#would yield the same result df.stack().groupby(level=1).sum().head(2).index
df[cols]
Australia Austria
date
2020-01-30 9 0
2020-01-31 9 9
You can also do this inline using the .pipe
function, which helps if you don't want to define a variable for a temporary result:您还可以使用.pipe
函数内联执行此操作,如果您不想为临时结果定义变量,这会.pipe
帮助:
df.pipe(lambda df: df.loc[:, df.sum().sort_values(ascending=False).index])
For example, you might have a pipeline:例如,您可能有一个管道:
new_df = (
df1
# Some example operations one might do:
.groupby('column')
.apply(sum).unstack()
.fillna(0).astype(int)
# Sort columns by total count:
.pipe(lambda df: df.loc[:, df.sum().sort_values(ascending=False).index])
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.