[英]Group by a column and count string in another column in Python Pandas dataframe
我關注 dataframe 作為我的 python 腳本的 output 。 我想添加另一列每個 pmid 的計數,其中gene_label 是匹配的基因。
dataframe 看起來像這樣:
df
PMID gene_symbol gene_label gene_mentions
0 33377242 MTHFR Matched Gene 2
1 33414971 CSF3R Matched Gene 13
2 33414971 BCR Other Gene 2
3 33414971 ABL1 Matched Gene 1
4 33414971 ESR1 Matched Gene 1
5 33414971 NDUFB3 Other Gene 1
6 33414971 CSF3 Other Gene 1
7 33414971 TP53 Matched Gene 2
8 33414971 SRC Matched Gene 1
9 33414971 JAK1 Matched Gene 1
預期的 output 為:
PMID gene_symbol gene_label gene_mentions matched_count
0 33377242 MTHFR Matched Gene 2 1
1 33414971 CSF3R Matched Gene 13 6
2 33414971 BCR Other Gene 2 6
3 33414971 ABL1 Matched Gene 1 6
4 33414971 ESR1 Matched Gene 1 6
5 33414971 NDUFB3 Other Gene 1 6
6 33414971 CSF3 Other Gene 1 6
7 33414971 TP53 Matched Gene 2 6
8 33414971 SRC Matched Gene 1 6
9 33414971 JAK1 Matched Gene 1 6
我使用了以下語句,但它沒有考慮其他行。
df.loc[df.gene_label == 'Matched Gene', 'PMID'].value_counts()
首先過濾只保留"Matched Gene"
標簽,按pmid
和gene_label
聚合和計數,然后加入原始dataframe。
# Setup
pmid = [33377242] + [33414971 for i in range(9)]
gene_symbol = ["MTHFR", "CSF3R", "BCR", "ABL1", "ESR1", "NDUFB3", "CSF3", "TP53", "SRC", "JAK1"]
gene_label = ["Matched Gene", "Matched Gene", "Other Gene", "Matched Gene", "Matched Gene", "Other Gene", "Other Gene", "Matched Gene", "Matched Gene", "Matched Gene"]
gene_mentions = [2, 13, 2, 1, 1, 1, 1, 2, 1, 1]
df = pd.DataFrame({"pmid":pmid, "gene_symbol":gene_symbol, "gene_label":gene_label, "gene_mentions":gene_mentions})
# Keep matched gene only
filter_df = df[df["gene_label"] == "Matched Gene"]
# Aggregate and count
agg_df = filter_df.groupby(["pmid", "gene_label"], as_index=False).agg(matched_count=("gene_label", "count"))
# Add count back to original dataframe by merging
df = df.merge(agg_df[["pmid", "matched_count"]], on="pmid")
output:
pmid gene_symbol gene_label gene_mentions matched_count
0 33377242 MTHFR Matched Gene 2 1
1 33414971 CSF3R Matched Gene 13 6
2 33414971 BCR Other Gene 2 6
3 33414971 ABL1 Matched Gene 1 6
4 33414971 ESR1 Matched Gene 1 6
5 33414971 NDUFB3 Other Gene 1 6
6 33414971 CSF3 Other Gene 1 6
7 33414971 TP53 Matched Gene 2 6
8 33414971 SRC Matched Gene 1 6
9 33414971 JAK1 Matched Gene 1 6
from io import StringIO
import pandas as pd
# Recreate df from posted question.
text = """PMID gene_symbol gene_label gene_mentions
0 33377242 MTHFR Matched Gene 2
1 33414971 CSF3R Matched Gene 13
2 33414971 BCR Other Gene 2
3 33414971 ABL1 Matched Gene 1
4 33414971 ESR1 Matched Gene 1
5 33414971 NDUFB3 Other Gene 1
6 33414971 CSF3 Other Gene 1
7 33414971 TP53 Matched Gene 2
8 33414971 SRC Matched Gene 1
9 33414971 JAK1 Matched Gene 1""".replace(
" Gene", "_Gene"
)
csv_text = "\n".join(",".join(line.split()) for line in text.splitlines()).replace(
"_Gene",
" Gene",
)
df = pd.read_csv(StringIO(csv_text), delimiter=",")
# Group by gene_label, PMID, get the sizes of "Matched Gene" column
sizes = df.groupby(by=["gene_label", "PMID"]).size()["Matched Gene"]
print(sizes)
# Create a new column that gets the size from sizes df by PMID value.
df["matched_count"] = df.PMID.apply(lambda x: sizes.loc[x])
print(df.to_string())
OUTPUT:
PMID
33377242 1
33414971 6
dtype: int64
PMID gene_symbol gene_label gene_mentions matched_count
0 33377242 MTHFR Matched Gene 2 1
1 33414971 CSF3R Matched Gene 13 6
2 33414971 BCR Other Gene 2 6
3 33414971 ABL1 Matched Gene 1 6
4 33414971 ESR1 Matched Gene 1 6
5 33414971 NDUFB3 Other Gene 1 6
6 33414971 CSF3 Other Gene 1 6
7 33414971 TP53 Matched Gene 2 6
8 33414971 SRC Matched Gene 1 6
9 33414971 JAK1 Matched Gene 1 6
您可以使用 groupby 轉換來完成此操作。 由於您正在尋找'Matched Gene'
的特定值,因此您需要對其進行過濾並進行分組。 然后您可以填充該值。
這將做:
df['matched_counts'] = df[df['gene_label']=='Matched Gene'].groupby(['PMID'])['gene_label'].transform('count')
df['matched_counts'] = df['matched_counts'].ffill().astype(int)
print (df)
output 將是:
PMID gene_symbol gene_label gene_mentions matched_counts
0 33377242 MTHFR Matched Gene 2 1
1 33414971 CSF3R Matched Gene 13 6
2 33414971 BCR Other Gene 2 6
3 33414971 ABL1 Matched Gene 1 6
4 33414971 ESR1 Matched Gene 1 6
5 33414971 NDUFB3 Other Gene 1 6
6 33414971 CSF3 Other Gene 1 6
7 33414971 TP53 Matched Gene 2 6
8 33414971 SRC Matched Gene 1 6
9 33414971 JAK1 Matched Gene 1 6
或者,您也可以這樣做:
df['matched_counts'] = df.groupby('PMID')['gene_label'].transform(lambda x: sum(x == 'Matched Gene'))
單行也將執行相同的技巧,並為您提供如上所示的結果。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.