繁体   English   中英

熊猫合并数据框并旋转创建新列

[英]pandas merge dataframe and pivot creating new columns

我有两个输入数据框

df1 (请注意,此DF可以包含更多的数据列)

   Sample Animal  Time     Sex
0       1      A   one    male
1       2      A   two    male
2       3      B   one  female
3       4      C   one    male
4       5      D   one  female

df2

          a    b    c
Sample               
1       0.2  0.4  0.3
2       0.5  0.7  0.2
3       0.4  0.1  0.9
4       0.4  0.2  0.3
5       0.6  0.2  0.4

并且我想将它们结合起来,以便获得以下信息:

        one_a  one_b  one_c  two_a  two_b  two_c     Sex
Animal                                                  
A         0.2    0.4    0.3    0.5    0.7    0.2    male
B         0.4    0.1    0.9    NaN    NaN    NaN  female
C         0.4    0.2    0.3    NaN    NaN    NaN    male
D         0.6    0.2    0.4    NaN    NaN    NaN  female

这就是我的工作方式:

df2.reset_index(inplace = True)
df3 = pd.melt(df2, id_vars=['Sample'], value_vars=list(cols))
df4 = pd.merge(df3, df1, on='Sample')
df4['moo'] = df4['Group'] + '_' + df4['variable']
df5 = pd.pivot_table(df4, values='value', index='Animal', columns='moo')
df6 = df1.groupby('Animal').agg('first')
pd.concat([df5, df6], axis=1).drop('Sample',1).drop('Group',1)

这工作得很好,但是对于大型数据集可能会很慢。 我想知道是否有任何熊猫专业人士看得更好( 阅读速度更快,效率更高)? 我是熊猫的新手,可以想象这里有一些我不知道的捷径。

这里有几个步骤。 最关键的是,为了像创建列one_a one_b .... two_c ,我们需要增加TimeSample索引建立一个多层次的索引,然后unstack ,以获得所需要的形式。 然后,需要基于Animal索引的groupby来聚合并减少NaN的数量。 其余只是格式上的一些操作。

import pandas as pd

# your data
# ==============================
# set index
df1 = df1.set_index('Sample')

print(df1)

       Animal Time     Sex
Sample                    
1           A  one    male
2           A  two    male
3           B  one  female
4           C  one    male
5           D  one  female

print(df2)


          a    b    c
Sample               
1       0.2  0.4  0.3
2       0.5  0.7  0.2
3       0.4  0.1  0.9
4       0.4  0.2  0.3
5       0.6  0.2  0.4



# processing
# =============================
df = df1.join(df2)

df_temp = df.set_index(['Animal', 'Sex','Time'], append=True).unstack()

print(df_temp)


                        a         b         c     
Time                  one  two  one  two  one  two
Sample Animal Sex                                 
1      A      male    0.2  NaN  0.4  NaN  0.3  NaN
2      A      male    NaN  0.5  NaN  0.7  NaN  0.2
3      B      female  0.4  NaN  0.1  NaN  0.9  NaN
4      C      male    0.4  NaN  0.2  NaN  0.3  NaN
5      D      female  0.6  NaN  0.2  NaN  0.4  NaN

# rename the columns if you wish
df_temp.columns = ['{}_{}'.format(x, y) for x, y in zip(df_temp.columns.get_level_values(1), df_temp.columns.get_level_values(0))]

print(df_temp)

                      one_a  two_a  one_b  two_b  one_c  two_c
Sample Animal Sex                                             
1      A      male      0.2    NaN    0.4    NaN    0.3    NaN
2      A      male      NaN    0.5    NaN    0.7    NaN    0.2
3      B      female    0.4    NaN    0.1    NaN    0.9    NaN
4      C      male      0.4    NaN    0.2    NaN    0.3    NaN
5      D      female    0.6    NaN    0.2    NaN    0.4    NaN


result = df_temp.reset_index('Sex').groupby(level='Animal').agg(max).sort_index(axis=1)

print(result)

           Sex  one_a  one_b  one_c  two_a  two_b  two_c
Animal                                                  
A         male    0.2    0.4    0.3    0.5    0.7    0.2
B       female    0.4    0.1    0.9    NaN    NaN    NaN
C         male    0.4    0.2    0.3    NaN    NaN    NaN
D       female    0.6    0.2    0.4    NaN    NaN    NaN

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM