简体   繁体   English

Pandas - 合并行并使用'get_dummies'添加列

[英]Pandas - Merge rows and add columns with 'get_dummies'

With the following dataframe: 使用以下数据帧:

import pandas as pd
df=pd.DataFrame(data=[[1,5179530,'rs10799170',8.1548,'E001'], [1,5179530,'rs10799170',8.1548,'E002'], [1,5179530,'rs10799170',8.1548,'E003'], [1,455521,'rs235884',2.584,'E003'], [1,455521,'rs235884',2.584,'E007']], col    umns=['CHR','BP','SNP','CM','ANNOT'])

   CHR       BP         SNP      CM ANNOT
0    1  5179530  rs10799170  8.1548  E001
1    1  5179530  rs10799170  8.1548  E002
2    1  5179530  rs10799170  8.1548  E003
3    1   455521    rs235884  2.5840  E003
4    1   455521    rs235884  2.5840  E007

I would like to obtain 我想获得

   CHR       BP         SNP      CM  E001  E002  E003  E007
0    1  5179530  rs10799170  8.1548     1     1     1     0  
1    1   455521    rs235884  2.5840     0     0     1     1

I tried groupby() and get_dummies() separately 我分别尝试了groupby()get_dummies()

df.groupby(['CHR','BP','SNP','CM']).sum()

    CHR BP      SNP        CM         ANNOT           
1   455521  rs235884   2.5840      E003E007
    5179530 rs10799170 8.1548  E001E002E003

pd.get_dummies(df['ANNOT'])

    E001  E002  E003  E007
0     1     0     0     0
1     0     1     0     0
2     0     0     1     0
3     0     0     1     0
4     0     0     0     1

But I don't know how to combine both or if there is another way. 但我不知道如何将两者结合起来或者如果有另一种方式。

As @Dadep points out in their comment, this can be achieved with a pivot table. 正如@Dadep在评论中指出的那样,这可以通过数据透视表来实现。 If you want to stick to your get_dummies + groupby technique though you can do something like: 如果你想坚持你的get_dummies + groupby技术,你可以做以下事情:

pd.concat([df, pd.get_dummies(df.ANNOT)], 1).groupby(['CHR','BP','SNP','CM']).sum().reset_index()

This first concatenates your dataframe and the output of the get_dummies call, then it groups the result according to the relevant columns, takes the sum of those columns among those groups and then resets the index so you don't have to deal with a multi-index data frame. 这首先连接数据帧和get_dummies调用的输出,然后根据相关列对结果进行分组,在这些组中获取这些列的总和,然后重置索引,这样您就不必处理多个索引数据框。 The result looks like: 结果如下:

   CHR       BP         SNP      CM  E001  E002  E003  E007
0    1   455521    rs235884  2.5840     0     0     1     1
1    1  5179530  rs10799170  8.1548     1     1     1     0

You are very close! 你很亲密! Just combine the two techniques: 只需结合两种技术:

dummies = pd.get_dummies(df['ANNOT'])
combine = pd.concat([df, dummies], axis=1)
out = combine.groupby(['BP','CHR','SNP','CM']).sum().reset_index()

Or depending on your application you might want to use .max instead of sum . 或者根据您的应用程序,您可能希望使用.max而不是sum Note that I changed the order in the groupby to prevent one CHR group. 请注意,我更改了groupby中的顺序以阻止一个CHR组。 just get the results in the order you want with: 只需按照您想要的顺序获得结果:

out = out[['CHR', 'BP', 'SNP', 'CM'] + list(dummies)]

Here's one way, using groupby and apply 这是一种方法,使用groupbyapply

In [66]: (df.groupby(['CHR', 'BP', 'SNP', 'CM'])
            .apply(lambda x: {y:1 for y in x['ANNOT']})
            .apply(pd.Series)
            .fillna(0)
            .reset_index())
Out[66]:
   CHR       BP         SNP      CM  E001  E002  E003  E007
0    1   455521    rs235884  2.5840   0.0   0.0   1.0   1.0
1    1  5179530  rs10799170  8.1548   1.0   1.0   1.0   0.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM