簡體   English   中英

按列分組並計算 Python Pandas dataframe 中另一列中的字符串

[英]Group by a column and count string in another column in Python Pandas dataframe

我關注 dataframe 作為我的 python 腳本的 output 。 我想添加另一列每個 pmid 的計數,其中gene_label 是匹配的基因。

dataframe 看起來像這樣:

df

       PMID gene_symbol    gene_label gene_mentions
0  33377242       MTHFR  Matched Gene             2
1  33414971       CSF3R  Matched Gene            13
2  33414971         BCR    Other Gene             2
3  33414971        ABL1  Matched Gene             1
4  33414971        ESR1  Matched Gene             1
5  33414971      NDUFB3    Other Gene             1
6  33414971        CSF3    Other Gene             1
7  33414971        TP53  Matched Gene             2
8  33414971         SRC  Matched Gene             1
9  33414971        JAK1  Matched Gene             1

預期的 output 為:

       PMID gene_symbol    gene_label gene_mentions   matched_count
0  33377242       MTHFR  Matched Gene             2   1
1  33414971       CSF3R  Matched Gene            13   6
2  33414971         BCR    Other Gene             2   6
3  33414971        ABL1  Matched Gene             1   6
4  33414971        ESR1  Matched Gene             1   6
5  33414971      NDUFB3    Other Gene             1   6
6  33414971        CSF3    Other Gene             1   6
7  33414971        TP53  Matched Gene             2   6
8  33414971         SRC  Matched Gene             1   6
9  33414971        JAK1  Matched Gene             1   6

我使用了以下語句,但它沒有考慮其他行。

df.loc[df.gene_label == 'Matched Gene', 'PMID'].value_counts()

首先過濾只保留"Matched Gene"標簽,按pmidgene_label聚合和計數,然后加入原始dataframe。

# Setup
pmid = [33377242] + [33414971 for i in range(9)]
gene_symbol = ["MTHFR", "CSF3R", "BCR", "ABL1", "ESR1", "NDUFB3", "CSF3", "TP53", "SRC", "JAK1"]
gene_label = ["Matched Gene", "Matched Gene", "Other Gene", "Matched Gene", "Matched Gene", "Other Gene", "Other Gene", "Matched Gene", "Matched Gene", "Matched Gene"]
gene_mentions = [2, 13, 2, 1, 1, 1, 1, 2, 1, 1]

df = pd.DataFrame({"pmid":pmid, "gene_symbol":gene_symbol, "gene_label":gene_label, "gene_mentions":gene_mentions})

# Keep matched gene only
filter_df = df[df["gene_label"] == "Matched Gene"]
# Aggregate and count
agg_df = filter_df.groupby(["pmid", "gene_label"], as_index=False).agg(matched_count=("gene_label", "count"))
# Add count back to original dataframe by merging
df = df.merge(agg_df[["pmid", "matched_count"]], on="pmid")

output:

       pmid gene_symbol    gene_label  gene_mentions  matched_count
0  33377242       MTHFR  Matched Gene              2              1
1  33414971       CSF3R  Matched Gene             13              6
2  33414971         BCR    Other Gene              2              6
3  33414971        ABL1  Matched Gene              1              6
4  33414971        ESR1  Matched Gene              1              6
5  33414971      NDUFB3    Other Gene              1              6
6  33414971        CSF3    Other Gene              1              6
7  33414971        TP53  Matched Gene              2              6
8  33414971         SRC  Matched Gene              1              6
9  33414971        JAK1  Matched Gene              1              6
from io import StringIO

import pandas as pd

# Recreate df from posted question.
text = """PMID gene_symbol    gene_label gene_mentions
0  33377242       MTHFR  Matched Gene             2
1  33414971       CSF3R  Matched Gene            13
2  33414971         BCR    Other Gene             2
3  33414971        ABL1  Matched Gene             1
4  33414971        ESR1  Matched Gene             1
5  33414971      NDUFB3    Other Gene             1
6  33414971        CSF3    Other Gene             1
7  33414971        TP53  Matched Gene             2
8  33414971         SRC  Matched Gene             1
9  33414971        JAK1  Matched Gene             1""".replace(
    " Gene", "_Gene"
)
csv_text = "\n".join(",".join(line.split()) for line in text.splitlines()).replace(
    "_Gene",
    " Gene",
)
df = pd.read_csv(StringIO(csv_text), delimiter=",")

# Group by gene_label, PMID, get the sizes of "Matched Gene" column
sizes = df.groupby(by=["gene_label", "PMID"]).size()["Matched Gene"]

print(sizes)

# Create a new column that gets the size from sizes df by PMID value.
df["matched_count"] = df.PMID.apply(lambda x: sizes.loc[x])

print(df.to_string())

OUTPUT:

    PMID
    33377242    1
    33414971    6
    dtype: int64
       PMID gene_symbol    gene_label  gene_mentions  matched_count
0  33377242       MTHFR  Matched Gene              2              1
1  33414971       CSF3R  Matched Gene             13              6
2  33414971         BCR    Other Gene              2              6
3  33414971        ABL1  Matched Gene              1              6
4  33414971        ESR1  Matched Gene              1              6
5  33414971      NDUFB3    Other Gene              1              6
6  33414971        CSF3    Other Gene              1              6
7  33414971        TP53  Matched Gene              2              6
8  33414971         SRC  Matched Gene              1              6
9  33414971        JAK1  Matched Gene              1              6

您可以使用 groupby 轉換來完成此操作。 由於您正在尋找'Matched Gene'的特定值,因此您需要對其進行過濾並進行分組。 然后您可以填充該值。

這將做:

df['matched_counts'] = df[df['gene_label']=='Matched Gene'].groupby(['PMID'])['gene_label'].transform('count')
df['matched_counts'] = df['matched_counts'].ffill().astype(int)
print (df)

output 將是:

       PMID gene_symbol    gene_label  gene_mentions  matched_counts
0  33377242       MTHFR  Matched Gene              2               1
1  33414971       CSF3R  Matched Gene             13               6
2  33414971         BCR    Other Gene              2               6
3  33414971        ABL1  Matched Gene              1               6
4  33414971        ESR1  Matched Gene              1               6
5  33414971      NDUFB3    Other Gene              1               6
6  33414971        CSF3    Other Gene              1               6
7  33414971        TP53  Matched Gene              2               6
8  33414971         SRC  Matched Gene              1               6
9  33414971        JAK1  Matched Gene              1               6

或者,您也可以這樣做:

df['matched_counts'] = df.groupby('PMID')['gene_label'].transform(lambda x: sum(x == 'Matched Gene'))

單行也將執行相同的技巧,並為您提供如上所示的結果。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM