简体   繁体   English

遍历两列并计算一列中有多少值与第二列中的精确值匹配?

[英]Iterate over two columns and count how many values in one column match with exact values in the second column?

I have a data frame that is the output of one application that overlapped mutations with genes.我有一个数据框,它是一个将突变与基因重叠的应用程序的 output。 Sometimes big mutations can overlap with more than one gene so the structure of this data frame is like this有时大突变可能与多个基因重叠,所以这个数据框的结构是这样的

mutation1        1gene_affected # mut1 only affected one gene
mutation2        1gene_affected # mut2 has affected 2 genes
mutation2        2gene_affected
mutation3        NO_gene_affected # there is also this. This can be filtered previously. 

How can I count somehow the我怎么能以某种方式计算

number of mutations that affect 1 gene,
number of mutations that affect 2 genes,
number of mutations that affect 3 genes,
number of mutations that affect 4 genes,
number of mutations that affect 5 genes,
number of mutations that affect > 5 but <10,
number of mutations that affect >10 but <20,
number of mutations that affect >30 genes,

I would like to save these values in variables and call a function I already created that saves statistics data in a file.我想将这些值保存在变量中,并调用我已经创建的将统计数据保存在文件中的 function。

Let's suppose the columns of your dataframe are following: ["mutation", "gene"] , using value_counts on mutation will give you the number of occurrence of each mutation.假设您的 dataframe 的列如下: ["mutation", "gene"] ,对突变使用value_counts将为您提供每个突变的发生次数。 Then a comparison function such as ge will suffice.那么ge之类的对比function就可以了。 For instance, to know all mutations affecting exactly X genes:例如,要知道影响 X 基因的所有突变:

mask_eq_X = df.loc[:, "mutation"].value_counts().eq(X)
print(df[mask_eq_X])

Edit编辑

For complex comparison, just combine some masks, for instance the >5 and <10 condition is exprimed as follow:对于复杂的比较,只需组合一些掩码,例如>5 和 <10条件如下所示:

mask_greater_than_5 = df.loc[:, "mutation"].value_counts().gt(5)
mask_lesser_than_10 = df.loc[:, "mutation"].value_counts().lt(10)

complex_mask = mask_greater_than_5 & mask_lesser_than_10

If this is your dataframe:如果这是您的 dataframe:

df
# Out: 
#         col1             col2
# 0  mutation1   1gene_affected
# 1  mutation2   1gene_affected
# 2  mutation2   2gene_affected
# 3  mutation3  NOgene_affected

You can group by the first column您可以按第一列分组

df.groupby('col2').count()
# Out: 
#                  col1
# col2                 
# 1gene_affected      2
# 2gene_affected      1
# NOgene_affected     1

Clean your second column then use pd.cut :清理第二列然后使用pd.cut

count = df['mutation'].str.replace('NO_', '0') \
                      .str.extract('^(\d+)', expand=False).astype(int)

lbls = ['No gene', '1 gene', '2 genes', '3 genes', '4 genes', '5 genes',
        'between 10 and 20', 'between 20 and 30', 'more than 30 genes']
bins = [-np.inf, 1, 2, 3, 4, 5, 10, 20, 30, np.inf]

df['group'] = pd.cut(count, bins=bins, labels=lbls, right=False)

out = df.value_counts('group', sort=False)

Output: Output:

>>> out
group
No gene               1
1 gene                2
2 genes               1
3 genes               0
4 genes               0
5 genes               0
between 10 and 20     0
between 20 and 30     0
more than 30 genes    0
dtype: int64

Setup:设置:

>>> df
        name          mutation
0  mutation1    1gene_affected
1  mutation2    1gene_affected
2  mutation2    2gene_affected
3  mutation3  NO_gene_affected

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM