[英]Group by a column and count string in another column in Python Pandas dataframe
I have following dataframe as an output of my python script.我关注 dataframe 作为我的 python 脚本的 output 。 I would like to add another column with count per pmid where gene_label is Matched Gene.我想添加另一列每个 pmid 的计数,其中gene_label 是匹配的基因。
The dataframe looks like this: dataframe 看起来像这样:
df df
PMID gene_symbol gene_label gene_mentions
0 33377242 MTHFR Matched Gene 2
1 33414971 CSF3R Matched Gene 13
2 33414971 BCR Other Gene 2
3 33414971 ABL1 Matched Gene 1
4 33414971 ESR1 Matched Gene 1
5 33414971 NDUFB3 Other Gene 1
6 33414971 CSF3 Other Gene 1
7 33414971 TP53 Matched Gene 2
8 33414971 SRC Matched Gene 1
9 33414971 JAK1 Matched Gene 1
The expected output is:预期的 output 为:
PMID gene_symbol gene_label gene_mentions matched_count
0 33377242 MTHFR Matched Gene 2 1
1 33414971 CSF3R Matched Gene 13 6
2 33414971 BCR Other Gene 2 6
3 33414971 ABL1 Matched Gene 1 6
4 33414971 ESR1 Matched Gene 1 6
5 33414971 NDUFB3 Other Gene 1 6
6 33414971 CSF3 Other Gene 1 6
7 33414971 TP53 Matched Gene 2 6
8 33414971 SRC Matched Gene 1 6
9 33414971 JAK1 Matched Gene 1 6
I have used following statement but it is not taking other rows into account.我使用了以下语句,但它没有考虑其他行。
df.loc[df.gene_label == 'Matched Gene', 'PMID'].value_counts()
First filter to only keep "Matched Gene"
labels, aggregate and count by pmid
and gene_label
, and then join back to original dataframe.首先过滤只保留"Matched Gene"
标签,按pmid
和gene_label
聚合和计数,然后加入原始dataframe。
# Setup
pmid = [33377242] + [33414971 for i in range(9)]
gene_symbol = ["MTHFR", "CSF3R", "BCR", "ABL1", "ESR1", "NDUFB3", "CSF3", "TP53", "SRC", "JAK1"]
gene_label = ["Matched Gene", "Matched Gene", "Other Gene", "Matched Gene", "Matched Gene", "Other Gene", "Other Gene", "Matched Gene", "Matched Gene", "Matched Gene"]
gene_mentions = [2, 13, 2, 1, 1, 1, 1, 2, 1, 1]
df = pd.DataFrame({"pmid":pmid, "gene_symbol":gene_symbol, "gene_label":gene_label, "gene_mentions":gene_mentions})
# Keep matched gene only
filter_df = df[df["gene_label"] == "Matched Gene"]
# Aggregate and count
agg_df = filter_df.groupby(["pmid", "gene_label"], as_index=False).agg(matched_count=("gene_label", "count"))
# Add count back to original dataframe by merging
df = df.merge(agg_df[["pmid", "matched_count"]], on="pmid")
output: output:
pmid gene_symbol gene_label gene_mentions matched_count
0 33377242 MTHFR Matched Gene 2 1
1 33414971 CSF3R Matched Gene 13 6
2 33414971 BCR Other Gene 2 6
3 33414971 ABL1 Matched Gene 1 6
4 33414971 ESR1 Matched Gene 1 6
5 33414971 NDUFB3 Other Gene 1 6
6 33414971 CSF3 Other Gene 1 6
7 33414971 TP53 Matched Gene 2 6
8 33414971 SRC Matched Gene 1 6
9 33414971 JAK1 Matched Gene 1 6
from io import StringIO
import pandas as pd
# Recreate df from posted question.
text = """PMID gene_symbol gene_label gene_mentions
0 33377242 MTHFR Matched Gene 2
1 33414971 CSF3R Matched Gene 13
2 33414971 BCR Other Gene 2
3 33414971 ABL1 Matched Gene 1
4 33414971 ESR1 Matched Gene 1
5 33414971 NDUFB3 Other Gene 1
6 33414971 CSF3 Other Gene 1
7 33414971 TP53 Matched Gene 2
8 33414971 SRC Matched Gene 1
9 33414971 JAK1 Matched Gene 1""".replace(
" Gene", "_Gene"
)
csv_text = "\n".join(",".join(line.split()) for line in text.splitlines()).replace(
"_Gene",
" Gene",
)
df = pd.read_csv(StringIO(csv_text), delimiter=",")
# Group by gene_label, PMID, get the sizes of "Matched Gene" column
sizes = df.groupby(by=["gene_label", "PMID"]).size()["Matched Gene"]
print(sizes)
# Create a new column that gets the size from sizes df by PMID value.
df["matched_count"] = df.PMID.apply(lambda x: sizes.loc[x])
print(df.to_string())
OUTPUT: OUTPUT:
PMID
33377242 1
33414971 6
dtype: int64
PMID gene_symbol gene_label gene_mentions matched_count
0 33377242 MTHFR Matched Gene 2 1
1 33414971 CSF3R Matched Gene 13 6
2 33414971 BCR Other Gene 2 6
3 33414971 ABL1 Matched Gene 1 6
4 33414971 ESR1 Matched Gene 1 6
5 33414971 NDUFB3 Other Gene 1 6
6 33414971 CSF3 Other Gene 1 6
7 33414971 TP53 Matched Gene 2 6
8 33414971 SRC Matched Gene 1 6
9 33414971 JAK1 Matched Gene 1 6
You can use groupby transform to get this done.您可以使用 groupby 转换来完成此操作。 Since you are looking for specific value of 'Matched Gene'
, you need to filter for that and do the groupby.由于您正在寻找'Matched Gene'
的特定值,因此您需要对其进行过滤并进行分组。 Then you can ffill the value.然后您可以填充该值。
This will do:这将做:
df['matched_counts'] = df[df['gene_label']=='Matched Gene'].groupby(['PMID'])['gene_label'].transform('count')
df['matched_counts'] = df['matched_counts'].ffill().astype(int)
print (df)
The output will be: output 将是:
PMID gene_symbol gene_label gene_mentions matched_counts
0 33377242 MTHFR Matched Gene 2 1
1 33414971 CSF3R Matched Gene 13 6
2 33414971 BCR Other Gene 2 6
3 33414971 ABL1 Matched Gene 1 6
4 33414971 ESR1 Matched Gene 1 6
5 33414971 NDUFB3 Other Gene 1 6
6 33414971 CSF3 Other Gene 1 6
7 33414971 TP53 Matched Gene 2 6
8 33414971 SRC Matched Gene 1 6
9 33414971 JAK1 Matched Gene 1 6
Alternate, you can also do this:或者,您也可以这样做:
df['matched_counts'] = df.groupby('PMID')['gene_label'].transform(lambda x: sum(x == 'Matched Gene'))
The single line will also do the same trick and give you the results as shown above.单行也将执行相同的技巧,并为您提供如上所示的结果。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.