简体   繁体   中英

Pandas: create columns with tuples as labels from unique pairs of row values

Imagine a df like this:

timestamp data_point_1 data_point_2 some_data
2021/06/24 a b 2
2021/06/24 c d 3
2021/06/25 c d 3

I want to change it to a df like this, that has tuples of unique value pairs of column data_point1 and data_point2 and only have the some_data column value for each timestamp :

timestamp (a,b) (c,d)
2021/06/24 2 3
2021/06/25 NaN 3

Here's the example data snippet:

import pandas as pd

test = pd.DataFrame({'timestamp': ["2021/06/24", "2021/06/24", "2021/06/25"], 'data_point_1': ["a", "c", "c"], 'data_point_2': ["b", "d", "d"], 'some_data': [2, 3, 3]})

print(test)
#    timestamp data_point_1 data_point_2  some_data
# 0  2021/06/24            a            b          2
# 1  2021/06/24            c            d          3
# 2  2021/06/25            c            d          3

# desired:
#    timestamp   (a,b)       (c,d)
# 0  2021/06/24    2           3
# 1  2021/06/25    0           3

Thanks :)

Use DataFrame.pivot with convert MultiIndex values to tuples:

df = test.pivot('timestamp', ['data_point_1','data_point_2'], 'some_data')
df.columns = [tuple(x) for x in df.columns]
df = df.reset_index()
print (df)
    timestamp  (a, b)  (c, d)
0  2021/06/24     2.0     3.0
1  2021/06/25     NaN     3.0

If need aggregate values, it means there are duplicates per timestamp, data_point_1, data_point_2 use DataFrame.pivot_table with some aggregate function like mean :

#if need aggregate values
#df = test.pivot_table(index='timestamp', 
                       columns=['data_point_1','data_point_2'], 
                       values='some_data', 
                       aggfunc='mean')
df.columns = [tuple(x) for x in df.columns]
df = df.reset_index()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM