简体   繁体   English

如何计算 pandas 中连续列表中元素的出现次数

[英]How to count the occurances of elements in list in for a row in pandas

I have a df that looks like this.我有一个看起来像这样的 df。 it is a multi-index df resulting from a group-by它是由 group-by 产生的多索引 df

grouped = df.groupby(['chromosome', 'start_pos', 'end_pos',
                      'observed']).agg(lambda x: x.tolist())
                                          reference         zygosity    
chromosome  start_pos   end_pos observed                                            
chr1            69428   69428       G       [T, T]          [hom, hom]      
                69511   69511       G       [A, A]          [hom, hom]      
                762273  762273      A       [G, G, G]       [hom, het, hom] 
                762589  762589      C       [G]             [hom]       
                762592  762592      G       [C]             [het]       

For each row i want to count the number of het and hom in the zygosity.对于每一行,我想计算合子中 het 和 hom 的数量。 and make a new column called 'count_hom' and 'count_het'并创建一个名为“count_hom”和“count_het”的新列

I have tried using for loop it is slow and not very reliable with changing data.我试过使用 for 循环,它很慢,而且随着数据的变化不太可靠。 Is there a way to do this using something like df.zygosity.len().sum() but only for het or only for hom有没有办法使用 df.zygosity.len().sum() 之类的方法来做到这一点,但仅适用于 het 或仅适用于 hom

Use Series.apply withList count :Series.applyList count一起使用:

grouped['count_hom'] = grouped['zygosity'].apply(lambda x: x.count('hom'))
grouped['count_het'] = grouped['zygosity'].apply(lambda x: x.count('het'))

Instead of working on groupby result, you could adjust your groupby construction a bit by including a lambda to agg that counts "het" and "hom" values for each group at the time you build grouped :您可以通过将 lambda 包含在agg中来稍微调整groupby的结构,而不是处理 groupby 结果,在您构建grouped时计算每个组的“het”和“hom”值:

grouped = (df.groupby(['chromosome', 'start_pos', 'end_pos','observed'])
           .agg(reference=('reference', list), 
                zygosity=('zygosity', list), 
                count_het=('zygosity', lambda x: x.eq('het').sum()),
                count_hom=('zygosity', lambda x: x.eq('hom').sum())))

If you want to create a list out of all lists, you could use the following:如果要从所有列表中创建一个列表,可以使用以下命令:

cols = ['chromosome', 'start_pos', 'end_pos','observed']
out = df.groupby(cols).agg(**{c: (c, list) for c in df.columns.drop('reference')}, 
                           count_het=('zygosity', lambda x: x.eq('het').sum()),
                           count_hom=('zygosity', lambda x: x.eq('hom').sum()))

You can dynamically count all possible values using explode + groupby , then value_counts , then unstack :您可以使用explode + groupby动态计算所有可能的值,然后是value_counts ,然后是unstack

new_df = pd.concat([df, df['zygosity'].explode().groupby(level=[0,1,2,3]).value_counts().unstack(level=4).fillna(0).add_prefix('count_').astype(int)], axis=1)

Output: Output:

>>> new_df
                                       reference         zygosity  count_het  count_hom
chromosome start_pos end_pos observed                                                  
chr1       69428     69428   G            [T, T]       [hom, hom]          0          2
           69511     69511   G            [A, A]       [hom, hom]          0          2
           762273    762273  A         [G, G, G]  [hom, het, hom]          1          2
           762589    762589  C               [G]            [hom]          0          1
           762592    762592  G               [C]            [het]          1          0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM