[英]Iterate over two columns and count how many values in one column match with exact values in the second column?
I have a data frame that is the output of one application that overlapped mutations with genes.我有一个数据框,它是一个将突变与基因重叠的应用程序的 output。 Sometimes big mutations can overlap with more than one gene so the structure of this data frame is like this
有时大突变可能与多个基因重叠,所以这个数据框的结构是这样的
mutation1 1gene_affected # mut1 only affected one gene
mutation2 1gene_affected # mut2 has affected 2 genes
mutation2 2gene_affected
mutation3 NO_gene_affected # there is also this. This can be filtered previously.
How can I count somehow the我怎么能以某种方式计算
number of mutations that affect 1 gene,
number of mutations that affect 2 genes,
number of mutations that affect 3 genes,
number of mutations that affect 4 genes,
number of mutations that affect 5 genes,
number of mutations that affect > 5 but <10,
number of mutations that affect >10 but <20,
number of mutations that affect >30 genes,
I would like to save these values in variables and call a function I already created that saves statistics data in a file.我想将这些值保存在变量中,并调用我已经创建的将统计数据保存在文件中的 function。
Let's suppose the columns of your dataframe are following: ["mutation", "gene"]
, using value_counts on mutation will give you the number of occurrence of each mutation.假设您的 dataframe 的列如下:
["mutation", "gene"]
,对突变使用value_counts将为您提供每个突变的发生次数。 Then a comparison function such as ge
will suffice.那么
ge
之类的对比function就可以了。 For instance, to know all mutations affecting exactly X genes:例如,要知道影响 X 基因的所有突变:
mask_eq_X = df.loc[:, "mutation"].value_counts().eq(X)
print(df[mask_eq_X])
For complex comparison, just combine some masks, for instance the >5 and <10 condition is exprimed as follow:对于复杂的比较,只需组合一些掩码,例如>5 和 <10条件如下所示:
mask_greater_than_5 = df.loc[:, "mutation"].value_counts().gt(5)
mask_lesser_than_10 = df.loc[:, "mutation"].value_counts().lt(10)
complex_mask = mask_greater_than_5 & mask_lesser_than_10
If this is your dataframe:如果这是您的 dataframe:
df
# Out:
# col1 col2
# 0 mutation1 1gene_affected
# 1 mutation2 1gene_affected
# 2 mutation2 2gene_affected
# 3 mutation3 NOgene_affected
You can group by the first column您可以按第一列分组
df.groupby('col2').count()
# Out:
# col1
# col2
# 1gene_affected 2
# 2gene_affected 1
# NOgene_affected 1
Clean your second column then use pd.cut
:清理第二列然后使用
pd.cut
:
count = df['mutation'].str.replace('NO_', '0') \
.str.extract('^(\d+)', expand=False).astype(int)
lbls = ['No gene', '1 gene', '2 genes', '3 genes', '4 genes', '5 genes',
'between 10 and 20', 'between 20 and 30', 'more than 30 genes']
bins = [-np.inf, 1, 2, 3, 4, 5, 10, 20, 30, np.inf]
df['group'] = pd.cut(count, bins=bins, labels=lbls, right=False)
out = df.value_counts('group', sort=False)
Output: Output:
>>> out
group
No gene 1
1 gene 2
2 genes 1
3 genes 0
4 genes 0
5 genes 0
between 10 and 20 0
between 20 and 30 0
more than 30 genes 0
dtype: int64
Setup:设置:
>>> df
name mutation
0 mutation1 1gene_affected
1 mutation2 1gene_affected
2 mutation2 2gene_affected
3 mutation3 NO_gene_affected
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.