[英]How to create hierarchical columns in pandas?
I have a pandas dataframe that looks like this: 我有一个看起来像这样的熊猫数据框:
rank_2015 num_2015 rank_2014 num_2014 .... num_2008
France 8 1200 9 1216 .... 1171
Italy 11 789 6 788 .... 654
Now I want to draw a bar chart of the sums just the num_
columns, by year. 现在,我想按年绘制仅
num_
列的总和的num_
。 So on the x-axis I would like years from 2008 to 2015, and on the y-axis I would like the sum of the related num_
column. 因此,在x轴上,我想要从2008年到2015年的年份,在y轴上,我想要相关的
num_
列的总和。
What's the best way to do this? 最好的方法是什么? I know how to get the sums for each column:
我知道如何获取每一列的总和:
df.sum()
But what I don't know is how to chart only the num_
columns, and also how to re-label those columns so that the labels are integers rather than strings, in order to get them to chart correctly. 但是我不知道如何只
num_
列,以及如何重新标记这些列,以使标签是整数而不是字符串,以便正确绘制图表。
I'm wondering if I want to create hierarchical columns, like this: 我想知道是否要创建分层列,如下所示:
rank num
2015 2014 2015 2014 .... 2008
France 8 9 1200 1216 .... 1171
Italy 11 6 789 788 .... 654
Then I could just chart the columns in the num
section. 然后,我可以将
num
部分中的列绘制成图表。
How can I get my dataframe into this shape? 如何使数据框变成这种形状?
You could use str.extract
with the regex pattern (.+)_(\\d+)
to convert the columns to a DataFrame: 您可以将
str.extract
与正则表达式模式(.+)_(\\d+)
以将列转换为DataFrame:
cols = df.columns.str.extract(r'(.+)_(\d+)', expand=True)
# 0 1
# 0 num 2008
# 1 num 2014
# 2 num 2015
# 3 rank 2014
# 4 rank 2015
You can then build a hierarchical (MultiIndex) index from cols
and reassign it to df.columns
: 然后,您可以从
cols
建立一个分层(MultiIndex)索引 ,并将其重新分配给df.columns
:
df.columns = pd.MultiIndex.from_arrays((cols[0], cols[1]))
so that df
becomes 使
df
变为
num rank
2008 2014 2015 2014 2015
France 1171 1216 1200 9 8
Italy 654 788 789 6 11
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({ 'num_2008': [1171, 654],
'num_2014': [1216, 788],
'num_2015': [1200, 789],
'rank_2014': [9, 6],
'rank_2015': [8, 11]}, index=['France', 'Italy'])
cols = df.columns.str.extract(r'(.+)_(\d+)', expand=True)
cols[1] = pd.to_numeric(cols[1])
df.columns = pd.MultiIndex.from_arrays((cols[0], cols[1]))
df.columns.names = [None]*2
df['num'].sum().plot(kind='bar')
plt.show()
Probably you don't need re-shaping your dataset, it can be achieved easier. 可能您不需要重新设置数据集的形状,可以轻松实现。
num_
data only num_
数据 Dummy data: 虚拟数据:
Code: 码:
df_num = df[[c for c in df.columns if c.startswith('num_')]]
df_num.columns = [c.lstrip('num_') for c in df_num.columns]
df_num.sum().plot(kind='bar')
Result: 结果:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.