I have a dataframe like this:
df1
sample x data data y
b a
d c
f e
h g
j i
l k
I need to create a new dataframe like this:
information identifier
b x
d x
f x
h x
j x
l x
a y
c y
e y
g y
i y
k y
Can this be done in pandas? It's like stacking one column on top of another but keeping a record of what type of information the column is. Many thanks.
Use str.split
by columns names and then reshape by DataFrame.unstack
, last some data cleaning by DataFrame.reset_index
:
#first data solution
df.columns = df.columns.str.split('_', expand=True)
df = (df.unstack()
.reset_index(level=[1,2], drop=True)
.rename_axis('identifier')
.reset_index(name='data')[['data','identifier']])
print (df)
data identifier
0 b x
1 d x
2 f x
3 h x
4 j x
5 l x
6 a y
7 c y
8 e y
9 g y
10 i y
11 k y
EDIT:
If use melt
then columns names create new column:
df = df.melt(var_name='identifier', value_name='information')
print (df)
identifier information
0 sample x data b
1 sample x data d
2 sample x data f
3 sample x data h
4 sample x data j
5 sample x data l
6 data y a
7 data y c
8 data y e
9 data y g
10 data y i
11 data y k
So you can extract values x
and y
:
df.columns = df.columns.str.extract('(x|y)', expand=False)
df = df.melt(var_name='identifier', value_name='information')
print (df)
identifier information
0 x b
1 x d
2 x f
3 x h
4 x j
5 x l
6 y a
7 y c
8 y e
9 y g
10 y i
11 y k
I think this approach is quite intuitive:
1) Split the columns and create a new dataframe with the values of x_data and 'x' as identifier for the other column (the same with 'y_data')
dx = pd.DataFrame(zip(df['x_data'].values.tolist(),['x']*(len(df['x_data'].values.tolist()))),columns=['data','identifier'])
dy = pd.DataFrame(zip(df['y_data'].values.tolist(),['y']*(len(df['y_data'].values.tolist()))),columns=['data','identifier'])
Consider this piece of code:
zip(df['x_data'].values.tolist(),['x']*(len(df['x_data'].values.tolist())))
What we are doing here is making two lists: 1st one with the values of x_data; 2nd one is a list with 'x' repeated for every element in x_data With zip we convert them in a single list and use the pd.DataFrame([list,columns]) to produce dataframe dx
2) Concatenate the dataframes to deliver a single one with the expected format
df = pd.concat([dx,dy])
print(df)
data identifier
0 b x
1 d x
2 f x
3 h x
4 j x
5 l x
6 x x
0 a y
1 c y
2 e y
3 g y
4 i y
5 k y
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.