简体   繁体   English

使用重复的列索引重塑Pandas DataFrame

[英]Reshaping Pandas DataFrame with Repeated Column Index

Suppose I have the following DataFrame: 假设我有以下DataFrame:

>>> cols = ['model', 'parameter', 'condition', 'value']
>>> df = pd.DataFrame([['BMW', '0-60', 'rain', '7'], ['BMW', '0-60', 'sun', '7'],
                   ['BMW','mpg', 'rain','25'], 
                   ['BMW', 'stars', 'rain','5'],
                   ['Toyota', '0-60', 'rain','9'], 
                   ['Toyota','mpg', 'rain','40'], 
                   ['Toyota', 'stars', 'rain','4']], columns=cols)

>>> df
    model parameter condition value
0     BMW      0-60      rain     7
1     BMW      0-60       sun     7
2     BMW       mpg      rain    25
3     BMW     stars      rain     5
4  Toyota      0-60      rain     9
5  Toyota       mpg      rain    40
6  Toyota     stars      rain     4

This is a list of performance metrics for various cars at different conditions. 这是在不同条件下各种汽车的性能指标列表。 This is a made up data set, of course, but its representative of my problem. 当然,这是一个组合数据集,但是它代表了我的问题。

What I ultimately want is to have observation for a given condition on its own row, and each metric on its own column. 我最终想要的是在自己的行上观察给定条件,在自己的列上观察每个指标。 This would look something like this: 看起来像这样:

    parameter  condition  0-60   mpg    stars
     model        
0     BMW       rain       7      25     5
1     BMW       sun        7      NaN    NaN
2     Toyota    rain       9      40     4

Note that I just made up the format above. 请注意,我只是组成了上面的格式。 I don't know if Pandas would generate something exactly like that, but that's the general idea. 我不知道熊猫会不会产生完全一样的东西,但这是一般的想法。 I would also of course transform the "condition" into a Boolean array and fill in the NaNs. 我当然也将“条件”转换为布尔数组并填写NaN。

My problem is that when I try to use the pivot method I get an error. 我的问题是,当我尝试使用数据透视方法时,出现错误。 I think this is because my "column" key is repeated (because I have BMW 0-60 stats for the rain and for the sun conditions). 我认为这是因为重复了我的“列”键(因为我有BMW 0-60的统计数据用于下雨和晒太阳)。

df.pivot(index='model',columns='parameter')
ValueError: Index contains duplicate entries, cannot reshape

Does anyone know of a slick way to do this? 有人知道这样做的巧妙方法吗? I'm finding a lot of these Pandas reshaping methods to be quite obtuse. 我发现许多这些重塑熊猫的方法都变得很钝。

You can just change the index and unstack it... 您可以更改索引并对其进行堆叠...

df.set_index(['model', 'condition', 'parameter']).unstack()

returns 回报

                 value           
parameter         0-60  mpg stars
model  condition                 
BMW    rain          7   25     5
       sun           7  NaN   NaN
Toyota rain          9   40     4

You can get the result you want using pivot_table and passing the following parameters: 您可以使用数据pivot_table并传递以下参数来获得所需的结果:

>>> df.pivot_table(index=['model', 'condition'], values='value', columns='parameter')
parameter         0-60  mpg  stars
model  condition                  
BMW    rain          7   25      5
       sun           7  NaN    NaN
Toyota rain          9   40      4

(You may need to ensure the "value" column has numeric types first or else you can pass aggfunc=lambda x: x in the pivot_table function to get around this requirement.) (您可能需要确保“值”列首先具有数字类型,否则可以在pivot_table函数中传递aggfunc=lambda x: x来解决此要求。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM