Question
I have a dataframe untidy
attribute value
0 age 49
1 sex M
2 height 176
3 age 27
4 sex F
5 height 172
where the values in the 'attribute'
column repeat periodically. The desired output is tidy
age sex height
0 49 M 176
1 27 F 172
(The row and column order or additional labels don't matter, I can clean this up myself.)
Code for instantiation:
untidy = pd.DataFrame([['age', 49],['sex', 'M'],['height', 176],['age', 27],['sex', 'F'],['height', 172]], columns=['attribute', 'value'])
tidy = pd.DataFrame([[49, 'M', 176], [27, 'F', 172]], columns=['age', 'sex', 'height'])
Attempts
This looks like a simple pivot-operation, but my initial approach introduces NaN
values:
>>> untidy.pivot(columns='attribute', values='value')
attribute age height sex
0 49 NaN NaN
1 NaN NaN M
2 NaN 176 NaN
3 27 NaN NaN
4 NaN NaN F
5 NaN 172 NaN
Some messy attempts to fix this:
>>> untidy.pivot(columns='attribute', values='value').apply(lambda c: c.dropna().reset_index(drop=True))
attribute age height sex
0 49 176 M
1 27 172 F
>>> untidy.set_index([untidy.index//untidy['attribute'].nunique(), 'attribute']).unstack('attribute')
value
attribute age height sex
0 49 176 M
1 27 172 F
What's the idiomatic way to do this?
Use pandas.pivot
with GroupBy.cumcount
for new index values and rename_axis
for remove columns name:
df = pd.pivot(index=untidy.groupby('attribute').cumcount(),
columns=untidy['attribute'],
values=untidy['value']).rename_axis(None, axis=1)
print (df)
age height sex
0 49 176 M
1 27 172 F
Another solution:
df = (untidy.set_index([untidy.groupby('attribute').cumcount(), 'attribute'])['value']
.unstack()
.rename_axis(None, axis=1))
An alternative approach would be to introduce a new column first with the cumulative count of age:
untidy["index"] = (untidy["attribute"] == "age").cumsum() - 1
Now untidy looks like
attribute value index
0 age 49 0
1 sex M 0
2 height 176 0
3 age 27 1
4 sex F 1
5 height 172 1
In this way you can create a multiindex dataframe based on attribute and index like this
tidy = untidy.set_index(["index", "attribute"]).unstack()
Which leads to the following format
value
attribute age height sex
index
0 49 176 M
1 27 172 F
The only problem still left is that the columns is a multi-index now with a level too much. You can get rid of it but transposing the columns as index first, drop the level of the index and transposing it back
tidy = tidy.T.reset_index(level=0).drop("level_0", axis=1).T
The final result is your tidy data frame
attribute age height sex
index
0 49 176 M
1 27 172 F
You can combine the second and third step to one of course. I am not sure if this is more idiomatic, but for me it is at least more intuitive.
pivot
should work fine, but if you are sure your labels repeat every 3rd row, you can use pd.concat
with a generator expression:
gen = (df.iloc[3*i:3*(i+1)].set_index('attribute').T for i in range(df.shape[0] // 3))
df = pd.concat(gen, ignore_index=True)
print(df)
attribute age sex height
0 49 M 176
1 27 F 172
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.