简体   繁体   English

操作 pandas DataFrame 中多行列的排序/排序

[英]Manipulate ordering/sorting of Multirow columns in a pandas DataFrame

This is a side-problem caused by an answer form another question .这是由另一个问题的答案引起的附带问题

I do combine two crosstab() results with counted and normalized values.我将两个crosstab()结果与计数值和归一化值结合起来。 The problem is that the resulting column names are not in the right order.问题是生成的列名称的顺序不正确。 "Right" means that the margins_name (in my example it is "gesamt" ) should always appear at the last row/column and not like this: “正确”意味着margins_name (在我的例子中是"gesamt" )应该总是出现在最后一行/列而不是像这样:

sex    female        gesamt         male
            n      %      n       %    n      %
age

What I need is我需要的是

sex    female          male       gesamt
            n      %      n       %    n      %
age

This is the minimal working example这是最小的工作示例

#!/usr/bin/env python3
import pandas as pd
import pydataset

# sample data
df = pydataset.data('agefat')
df = df.loc[df.age < 35]

# Label of the margin column/row
mn = 'gesamt'

# count absolute
taba = pd.crosstab(df.age, df.sex, margins=True, margins_name=mn)
# percentage / normalized
tabb = pd.crosstab(df.age, df.sex, margins=True, margins_name=mn,
                   normalize=True).round(4)*100

# combine (based on: https://stackoverflow.com/a/68362010/4865723)
tab = pd.concat([taba, tabb], axis=1, keys=['n', '%']).swaplevel(axis=1)

# sort the columns        
tab = tab.sort_index(axis=1, ascending=[True, False])

print(tab)

Also I have a possible solution which works but I am not sure if this is a good panda's way .我也有一个可行的解决方案,但我不确定这是否是熊猫的好方法 I do manipulate the sorting-algorithm this way that the margins_name always get the highest possible chr() value to make it appear always at the end of a lexicographical ordering.我确实以这种方式操纵排序算法,使margins_name始终获得可能的最高chr()值,以使其始终出现在字典顺序的末尾。

# workaround
tab = tab.sort_index(axis=1, ascending=[False, False],
                     key=lambda x: x.where(x.isin([mn]), chr(0x10ffff)))

print(tab)  # looks like I expect

The result output结果 output

sex    female        male        gesamt
            n      %    n      %      n       %
age
23          1  16.67    1  16.67      2   33.33
24          0   0.00    1  16.67      1   16.67
27          0   0.00    2  33.33      2   33.33
31          1  16.67    0   0.00      1   16.67
gesamt      2  33.33    4  66.67      6  100.00

Use ordered CategoricalIndex for custom ordering of first level of MultiIndex :使用有序 CategoricalIndex 自定义MultiIndex的第一级MultiIndex

i = tab.columns.levels[0]
out = sorted(i.difference([mn]))
out.append(mn)

new = pd.CategoricalIndex(i, ordered=True, categories=out)
tab.columns = tab.columns.set_levels(new,level=0)

tab = tab.sort_index(axis=1, ascending=[True, False])

print(tab)
sex    female        male        gesamt        
            n      %    n      %      n       %
age                                            
2000        2  33.33    0   0.00      2   33.33
2001        1  16.67    1  16.67      2   33.33
2002        1  16.67    1  16.67      2   33.33
gesamt      4  66.67    2  33.33      6  100.00

I would just select the total columns using a list comprehension and piece together the columns selection as desired:我只想使用列表理解选择总列,并根据需要拼凑列选择:

cols_tot = [c for c in tab.columns if c[0] == mn]
print(tab[[c for c in tab.columns if not c in cols_tot] + cols_tot])

sex    female        male        gesamt        
            n      %    n      %      n       %
age                                            
23          1  16.67    1  16.67      2   33.33
24          0   0.00    1  16.67      1   16.67
27          0   0.00    2  33.33      2   33.33
31          1  16.67    0   0.00      1   16.67
gesamt      2  33.33    4  66.67      6  100.00

Please let me highlight a detail in addition to @jezrael's original answer .除了@jezrael 的原始答案之外,请让我强调一个细节。

We still know from @jezrael 's answer that .sort_index() does take the ordering of categories into account.从@jezrael 的回答中我们仍然知道.sort_index()确实考虑了类别的顺序。 This has consequences when you crosstab() on a column that still is an ordered categorical and you adding a margin= (eg a total column) to the crosstab.当您在仍然是有序分类的列上使用crosstab() ) 并向交叉表添加margin= (例如total列)时,这会产生后果。

Going back to the MWE of my question.回到我的问题的 MWE。 Lets assume that age is not a number but a ordered cateogry .让我们假设age不是一个数字而是一个有序的类别

['younger then 20' < '20 till 60' < 'older then 60']

The column will lose its categorical order and .sort_index() will sort it only by its lexicographical order when you do it like this (as in the original MWE):该列将失去其分类顺序,并且.sort_index()将仅按其词典顺序对其进行排序(如在原始 MWE 中):

# Label of the margin column/row
mn = 'gesamt'
# count absolute
taba = pd.crosstab(df.age, df.sex, margins=True, margins_name=mn)

What you have to do is to add the margin= column as one of the categories before calling .crosstab() .您要做的是在调用.crosstab()之前margin=列添加为类别之一。

df.age = df.age.cat.add_categories([mn])  # mn=='gesamt'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM