简体   繁体   中英

Pivot a two-column dataframe

Question

I have a dataframe untidy

  attribute value
0       age    49
1       sex     M
2    height   176
3       age    27
4       sex     F
5    height   172

where the values in the 'attribute' column repeat periodically. The desired output is tidy

  age sex height
0  49   M    176
1  27   F    172

(The row and column order or additional labels don't matter, I can clean this up myself.)

Code for instantiation:

untidy = pd.DataFrame([['age', 49],['sex', 'M'],['height', 176],['age', 27],['sex', 'F'],['height', 172]], columns=['attribute', 'value'])
tidy = pd.DataFrame([[49, 'M', 176], [27, 'F', 172]], columns=['age', 'sex', 'height']) 

Attempts

This looks like a simple pivot-operation, but my initial approach introduces NaN values:

>>> untidy.pivot(columns='attribute', values='value')                                                                                                       
attribute  age height  sex
0           49    NaN  NaN
1          NaN    NaN    M
2          NaN    176  NaN
3           27    NaN  NaN
4          NaN    NaN    F
5          NaN    172  NaN

Some messy attempts to fix this:

>>> untidy.pivot(columns='attribute', values='value').apply(lambda c: c.dropna().reset_index(drop=True))
attribute age height sex
0          49    176   M
1          27    172   F


>>> untidy.set_index([untidy.index//untidy['attribute'].nunique(), 'attribute']).unstack('attribute')
          value           
attribute   age height sex
0            49    176   M
1            27    172   F

What's the idiomatic way to do this?

Use pandas.pivot with GroupBy.cumcount for new index values and rename_axis for remove columns name:

df = pd.pivot(index=untidy.groupby('attribute').cumcount(),
              columns=untidy['attribute'], 
              values=untidy['value']).rename_axis(None, axis=1) 
print (df)
  age height sex
0  49    176   M
1  27    172   F

Another solution:

df = (untidy.set_index([untidy.groupby('attribute').cumcount(), 'attribute'])['value']
            .unstack()
            .rename_axis(None, axis=1))

An alternative approach would be to introduce a new column first with the cumulative count of age:

untidy["index"] = (untidy["attribute"] == "age").cumsum() - 1

Now untidy looks like

      attribute value  index
0       age    49      0
1       sex     M      0
2    height   176      0
3       age    27      1
4       sex     F      1
5    height   172      1

In this way you can create a multiindex dataframe based on attribute and index like this

tidy = untidy.set_index(["index", "attribute"]).unstack()

Which leads to the following format

              value           
attribute   age height sex
index                     
0            49    176   M
1            27    172   F

The only problem still left is that the columns is a multi-index now with a level too much. You can get rid of it but transposing the columns as index first, drop the level of the index and transposing it back

tidy = tidy.T.reset_index(level=0).drop("level_0", axis=1).T

The final result is your tidy data frame

    attribute age height sex
index                   
0          49    176   M
1          27    172   F

You can combine the second and third step to one of course. I am not sure if this is more idiomatic, but for me it is at least more intuitive.

pivot should work fine, but if you are sure your labels repeat every 3rd row, you can use pd.concat with a generator expression:

gen = (df.iloc[3*i:3*(i+1)].set_index('attribute').T for i in range(df.shape[0] // 3))
df = pd.concat(gen, ignore_index=True)

print(df)

attribute age sex height
0          49   M    176
1          27   F    172

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM