[英]Manipulate ordering/sorting of Multirow columns in a pandas DataFrame
This is a side-problem caused by an answer form another question .这是由另一个问题的答案引起的附带问题。
I do combine two crosstab()
results with counted and normalized values.我将两个
crosstab()
结果与计数值和归一化值结合起来。 The problem is that the resulting column names are not in the right order.问题是生成的列名称的顺序不正确。 "Right" means that the
margins_name
(in my example it is "gesamt"
) should always appear at the last row/column and not like this: “正确”意味着
margins_name
(在我的例子中是"gesamt"
)应该总是出现在最后一行/列而不是像这样:
sex female gesamt male
n % n % n %
age
What I need is我需要的是
sex female male gesamt
n % n % n %
age
This is the minimal working example这是最小的工作示例
#!/usr/bin/env python3
import pandas as pd
import pydataset
# sample data
df = pydataset.data('agefat')
df = df.loc[df.age < 35]
# Label of the margin column/row
mn = 'gesamt'
# count absolute
taba = pd.crosstab(df.age, df.sex, margins=True, margins_name=mn)
# percentage / normalized
tabb = pd.crosstab(df.age, df.sex, margins=True, margins_name=mn,
normalize=True).round(4)*100
# combine (based on: https://stackoverflow.com/a/68362010/4865723)
tab = pd.concat([taba, tabb], axis=1, keys=['n', '%']).swaplevel(axis=1)
# sort the columns
tab = tab.sort_index(axis=1, ascending=[True, False])
print(tab)
Also I have a possible solution which works but I am not sure if this is a good panda's way .我也有一个可行的解决方案,但我不确定这是否是熊猫的好方法。 I do manipulate the sorting-algorithm this way that the
margins_name
always get the highest possible chr()
value to make it appear always at the end of a lexicographical ordering.我确实以这种方式操纵排序算法,使
margins_name
始终获得可能的最高chr()
值,以使其始终出现在字典顺序的末尾。
# workaround
tab = tab.sort_index(axis=1, ascending=[False, False],
key=lambda x: x.where(x.isin([mn]), chr(0x10ffff)))
print(tab) # looks like I expect
The result output结果 output
sex female male gesamt
n % n % n %
age
23 1 16.67 1 16.67 2 33.33
24 0 0.00 1 16.67 1 16.67
27 0 0.00 2 33.33 2 33.33
31 1 16.67 0 0.00 1 16.67
gesamt 2 33.33 4 66.67 6 100.00
Use ordered CategoricalIndex for custom ordering of first level of MultiIndex
:使用有序 CategoricalIndex 自定义
MultiIndex
的第一级MultiIndex
:
i = tab.columns.levels[0]
out = sorted(i.difference([mn]))
out.append(mn)
new = pd.CategoricalIndex(i, ordered=True, categories=out)
tab.columns = tab.columns.set_levels(new,level=0)
tab = tab.sort_index(axis=1, ascending=[True, False])
print(tab)
sex female male gesamt
n % n % n %
age
2000 2 33.33 0 0.00 2 33.33
2001 1 16.67 1 16.67 2 33.33
2002 1 16.67 1 16.67 2 33.33
gesamt 4 66.67 2 33.33 6 100.00
I would just select the total columns using a list comprehension and piece together the columns selection as desired:我只想使用列表理解选择总列,并根据需要拼凑列选择:
cols_tot = [c for c in tab.columns if c[0] == mn]
print(tab[[c for c in tab.columns if not c in cols_tot] + cols_tot])
sex female male gesamt
n % n % n %
age
23 1 16.67 1 16.67 2 33.33
24 0 0.00 1 16.67 1 16.67
27 0 0.00 2 33.33 2 33.33
31 1 16.67 0 0.00 1 16.67
gesamt 2 33.33 4 66.67 6 100.00
Please let me highlight a detail in addition to @jezrael's original answer .除了@jezrael 的原始答案之外,请让我强调一个细节。
We still know from @jezrael 's answer that .sort_index()
does take the ordering of categories into account.从@jezrael 的回答中我们仍然知道
.sort_index()
确实考虑了类别的顺序。 This has consequences when you crosstab()
on a column that still is an ordered categorical and you adding a margin=
(eg a total
column) to the crosstab.当您在仍然是有序分类的列上使用
crosstab()
) 并向交叉表添加margin=
(例如total
列)时,这会产生后果。
Going back to the MWE of my question.回到我的问题的 MWE。 Lets assume that
age
is not a number but a ordered cateogry .让我们假设
age
不是一个数字而是一个有序的类别。
['younger then 20' < '20 till 60' < 'older then 60']
The column will lose its categorical order and .sort_index()
will sort it only by its lexicographical order when you do it like this (as in the original MWE):该列将失去其分类顺序,并且
.sort_index()
将仅按其词典顺序对其进行排序(如在原始 MWE 中):
# Label of the margin column/row
mn = 'gesamt'
# count absolute
taba = pd.crosstab(df.age, df.sex, margins=True, margins_name=mn)
What you have to do is to add the margin=
column as one of the categories before calling .crosstab()
.您要做的是在调用
.crosstab()
之前将margin=
列添加为类别之一。
df.age = df.age.cat.add_categories([mn]) # mn=='gesamt'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.