简体   繁体   English

按 0 级索引的最后一个值对 Pandas MultiIndex 进行排序

[英]Sorting Pandas MultiIndex by the last value of level 0 index

I have a df called df_world with the following shape:我有一个名为df_world的 df,其形状如下:

                               Cases   Death  Delta_Cases  Delta_Death
Country/Region Date                                                       
Brazil         2020-01-22        0.0       0          NaN          NaN
               2020-01-23        0.0       0          0.0          0.0
               2020-01-24        0.0       0          0.0          0.0
               2020-01-25        0.0       0          0.0          0.0
               2020-01-26        0.0       0          0.0          0.0
                             ...     ...          ...          ...
World          2020-05-12  4261747.0  291942      84245.0       5612.0
               2020-05-13  4347018.0  297197      85271.0       5255.0
               2020-05-14  4442163.0  302418      95145.0       5221.0
               2020-05-15  4542347.0  307666     100184.0       5248.0
               2020-05-16  4634068.0  311781      91721.0       4115.0

I'de like to sort the country index by the value of the columns 'Cases' on the last recording ie comparing the cases values on 2020-05-16 for all countries and return the sorted country list我想按最后一次记录中“案例”列的值对国家索引进行排序,即比较所有国家/地区 2020 年 5 月 16 日的案例值并返回排序后的国家/地区列表

I thought about creating another df with only the 2020-05-16 values and then use the df.sort_values() method but I am sure there has to be a more efficient way.我考虑过仅使用 2020-05-16 值创建另一个 df,然后使用df.sort_values()方法,但我确信必须有更有效的方法。

While I'm at it, I've also tried to only select the countries that have a number of cases on 2020-05-16 above a certain value and the only way I found to do it was to iterate over the Country index:当我这样做时,我还尝试仅 select 那些在 2020 年 5 月 16 日有许多病例超过一定值的国家,我发现这样做的唯一方法是遍历国家索引:

for a_country in df_world.index.levels[0]:
        if df_world.loc[(a_country, last_date), 'Cases'] < cut_off_val:
            df_world = df_world.drop(index=a_country)

But it's quite a poor way to do it.但这是一种非常糟糕的方法。

If anyone has an idea on how the improve the efficiency of this code I'de be very happy.如果有人对如何提高此代码的效率有任何想法,我将非常高兴。

Thank you:)谢谢:)

You can first group thee dataset by "Country/Region", then sort each group by "Date", take the last one, and sort again by "Cases".您可以先按“国家/地区”对数据集进行分组,然后按“日期”对每个组进行排序,取最后一个,然后按“案例”再次排序。

Faking some data myself (data types are different but you see my point):自己伪造一些数据(数据类型不同,但你明白我的意思):

df = pd.DataFrame([['a', 1, 100],
                   ['a', 2, 10],
                   ['b', 2, 55],
                   ['b', 3, 15],
                   ['c', 1, 22],
                   ['c', 3, 80]])
df.columns = ['country', 'date', 'cases']
df = df.set_index(['country', 'date'])
print(df)
#               cases
# country date       
# a       1       100
#         2        10
# b       2        55
#         3        15
# c       1        22
#         3        80

Then,然后,

# group them by country
grp_by_country = df.groupby(by='country')
# for each group, aggregate by sorting by data and taking the last row (latest date)
latest_per_grp = grp_by_country.agg(lambda x: x.sort_values(by='date').iloc[-1])
# sort again by cases
sorted_by_cases = latest_per_grp.sort_values(by='cases')

print(sorted_by_cases)
#          cases
# country       
# a           10
# b           15
# c           80

Stay safe!注意安全!

last_recs = df_world.reset_index().groupby('Country/Region').last()
sorted_countries = last_recs.sort_values('Cases')['Country/Region']

As I don't have your raw data, I can't test it but this should do what you need.由于我没有您的原始数据,因此无法对其进行测试,但这应该可以满足您的需要。 All methods are self-explanatory I believe.我相信所有方法都是不言自明的。

you may need to sort df_world by the dates in the first line if it isn't the case.如果不是这种情况,您可能需要按第一行中的日期对 df_world 进行排序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM