简体   繁体   English

在Dataframe Pandas的列上追加和删除重复值

[英]Appending and Removing Repeated Values on a column of a Dataframe Pandas

so I have a dataframe that I made via df4.append(df3,ignore_index= True) ; 所以我有一个通过df4.append(df3,ignore_index = True)制作的数据帧; however, I am having some issues removing repeats in my column Gene_symbol while still keeping the values in case 1, 2 and 3. I have already tried df4.drop_duplicates(["Gene_Symbol"]) and various other methods, all of which tend to delete the other rows and with it my Data. 但是,我在我的列Gene_symbol中删除重复时遇到了一些问题,同时仍然保留了案例1,2和3中的值。我已经尝试过df4.drop_duplicates([“Gene_Symbol”])和其他各种方法,所有这些都倾向于删除其他行并使用我的数据。

What I am getting is this: 我得到的是这样的:

         X       Case1       Case2       Case3       Gene_Symbol 
8026    8025    0.5326718   0.0000000   0.0000000   GAPDHS;TMEM147
32531   32530   0.0000000   0.5416982   0.0000000   GAPDHS;TMEM147
57051   57050   0.0000000   0.0000000   0.4821592   GAPDHS;TMEM147

What I would like to have is a dataframe below where my actual values are kept 我想要的是下面的数据框,其中保留了我的实际值

     Case1       Case2       Case3       Gene_Symbol 
    0.5326718   0.5416982   0.4821592   GAPDHS;TMEM147

Thank you for your time! 感谢您的时间!

You could try the following, if all Cases columns contain only one non zero values for each gene , this should work (assume you don't have the X column which looks like an index): 您可以尝试以下方法,如果所有Cases列只包含每个基因的一个非零值,这应该有效(假设您没有看起来像索引的X列):

df.set_index('Gene_Symbol').stack()[lambda x: x != 0].unstack(level=1).reset_index()

#      Gene_Symbol     Case1       Case2       Case3
#0  GAPDHS;TMEM147  0.532672    0.541698    0.482159

Or: 要么:

df
#          X       Case1       Case2       Case3       Gene_Symbol
#8026   8025    0.532672    0.000000    0.000000    GAPDHS;TMEM147
#32531  32530   0.000000    0.541698    0.000000    GAPDHS;TMEM147
#57051  57050   0.000000    0.000000    0.482159    GAPDHS;TMEM147

df.drop('X', 1, inplace=True)

df.set_index('Gene_Symbol').stack()[lambda x: x != 0].unstack(level=1).reset_index()
​
#      Gene_Symbol     Case1       Case2       Case3
#0  GAPDHS;TMEM147  0.532672    0.541698    0.482159

How about 怎么样

df = df.groupby('Gene_Symbol')['Case1', 'Case2', 'Case3'].sum().reset_index()

    Gene_Symbol     Case1       Case2       Case3
0   GAPDHS;TMEM147  0.532672    0.541698    0.482159

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM