简体   繁体   English

按列分组并计算 Python Pandas dataframe 中另一列中的字符串

[英]Group by a column and count string in another column in Python Pandas dataframe

I have following dataframe as an output of my python script.我关注 dataframe 作为我的 python 脚本的 output 。 I would like to add another column with count per pmid where gene_label is Matched Gene.我想添加另一列每个 pmid 的计数,其中gene_label 是匹配的基因。

The dataframe looks like this: dataframe 看起来像这样:

df df

       PMID gene_symbol    gene_label gene_mentions
0  33377242       MTHFR  Matched Gene             2
1  33414971       CSF3R  Matched Gene            13
2  33414971         BCR    Other Gene             2
3  33414971        ABL1  Matched Gene             1
4  33414971        ESR1  Matched Gene             1
5  33414971      NDUFB3    Other Gene             1
6  33414971        CSF3    Other Gene             1
7  33414971        TP53  Matched Gene             2
8  33414971         SRC  Matched Gene             1
9  33414971        JAK1  Matched Gene             1

The expected output is:预期的 output 为:

       PMID gene_symbol    gene_label gene_mentions   matched_count
0  33377242       MTHFR  Matched Gene             2   1
1  33414971       CSF3R  Matched Gene            13   6
2  33414971         BCR    Other Gene             2   6
3  33414971        ABL1  Matched Gene             1   6
4  33414971        ESR1  Matched Gene             1   6
5  33414971      NDUFB3    Other Gene             1   6
6  33414971        CSF3    Other Gene             1   6
7  33414971        TP53  Matched Gene             2   6
8  33414971         SRC  Matched Gene             1   6
9  33414971        JAK1  Matched Gene             1   6

I have used following statement but it is not taking other rows into account.我使用了以下语句,但它没有考虑其他行。

df.loc[df.gene_label == 'Matched Gene', 'PMID'].value_counts()

First filter to only keep "Matched Gene" labels, aggregate and count by pmid and gene_label , and then join back to original dataframe.首先过滤只保留"Matched Gene"标签,按pmidgene_label聚合和计数,然后加入原始dataframe。

# Setup
pmid = [33377242] + [33414971 for i in range(9)]
gene_symbol = ["MTHFR", "CSF3R", "BCR", "ABL1", "ESR1", "NDUFB3", "CSF3", "TP53", "SRC", "JAK1"]
gene_label = ["Matched Gene", "Matched Gene", "Other Gene", "Matched Gene", "Matched Gene", "Other Gene", "Other Gene", "Matched Gene", "Matched Gene", "Matched Gene"]
gene_mentions = [2, 13, 2, 1, 1, 1, 1, 2, 1, 1]

df = pd.DataFrame({"pmid":pmid, "gene_symbol":gene_symbol, "gene_label":gene_label, "gene_mentions":gene_mentions})

# Keep matched gene only
filter_df = df[df["gene_label"] == "Matched Gene"]
# Aggregate and count
agg_df = filter_df.groupby(["pmid", "gene_label"], as_index=False).agg(matched_count=("gene_label", "count"))
# Add count back to original dataframe by merging
df = df.merge(agg_df[["pmid", "matched_count"]], on="pmid")

output: output:

       pmid gene_symbol    gene_label  gene_mentions  matched_count
0  33377242       MTHFR  Matched Gene              2              1
1  33414971       CSF3R  Matched Gene             13              6
2  33414971         BCR    Other Gene              2              6
3  33414971        ABL1  Matched Gene              1              6
4  33414971        ESR1  Matched Gene              1              6
5  33414971      NDUFB3    Other Gene              1              6
6  33414971        CSF3    Other Gene              1              6
7  33414971        TP53  Matched Gene              2              6
8  33414971         SRC  Matched Gene              1              6
9  33414971        JAK1  Matched Gene              1              6
from io import StringIO

import pandas as pd

# Recreate df from posted question.
text = """PMID gene_symbol    gene_label gene_mentions
0  33377242       MTHFR  Matched Gene             2
1  33414971       CSF3R  Matched Gene            13
2  33414971         BCR    Other Gene             2
3  33414971        ABL1  Matched Gene             1
4  33414971        ESR1  Matched Gene             1
5  33414971      NDUFB3    Other Gene             1
6  33414971        CSF3    Other Gene             1
7  33414971        TP53  Matched Gene             2
8  33414971         SRC  Matched Gene             1
9  33414971        JAK1  Matched Gene             1""".replace(
    " Gene", "_Gene"
)
csv_text = "\n".join(",".join(line.split()) for line in text.splitlines()).replace(
    "_Gene",
    " Gene",
)
df = pd.read_csv(StringIO(csv_text), delimiter=",")

# Group by gene_label, PMID, get the sizes of "Matched Gene" column
sizes = df.groupby(by=["gene_label", "PMID"]).size()["Matched Gene"]

print(sizes)

# Create a new column that gets the size from sizes df by PMID value.
df["matched_count"] = df.PMID.apply(lambda x: sizes.loc[x])

print(df.to_string())

OUTPUT: OUTPUT:

    PMID
    33377242    1
    33414971    6
    dtype: int64
       PMID gene_symbol    gene_label  gene_mentions  matched_count
0  33377242       MTHFR  Matched Gene              2              1
1  33414971       CSF3R  Matched Gene             13              6
2  33414971         BCR    Other Gene              2              6
3  33414971        ABL1  Matched Gene              1              6
4  33414971        ESR1  Matched Gene              1              6
5  33414971      NDUFB3    Other Gene              1              6
6  33414971        CSF3    Other Gene              1              6
7  33414971        TP53  Matched Gene              2              6
8  33414971         SRC  Matched Gene              1              6
9  33414971        JAK1  Matched Gene              1              6

You can use groupby transform to get this done.您可以使用 groupby 转换来完成此操作。 Since you are looking for specific value of 'Matched Gene' , you need to filter for that and do the groupby.由于您正在寻找'Matched Gene'的特定值,因此您需要对其进行过滤并进行分组。 Then you can ffill the value.然后您可以填充该值。

This will do:这将做:

df['matched_counts'] = df[df['gene_label']=='Matched Gene'].groupby(['PMID'])['gene_label'].transform('count')
df['matched_counts'] = df['matched_counts'].ffill().astype(int)
print (df)

The output will be: output 将是:

       PMID gene_symbol    gene_label  gene_mentions  matched_counts
0  33377242       MTHFR  Matched Gene              2               1
1  33414971       CSF3R  Matched Gene             13               6
2  33414971         BCR    Other Gene              2               6
3  33414971        ABL1  Matched Gene              1               6
4  33414971        ESR1  Matched Gene              1               6
5  33414971      NDUFB3    Other Gene              1               6
6  33414971        CSF3    Other Gene              1               6
7  33414971        TP53  Matched Gene              2               6
8  33414971         SRC  Matched Gene              1               6
9  33414971        JAK1  Matched Gene              1               6

Alternate, you can also do this:或者,您也可以这样做:

df['matched_counts'] = df.groupby('PMID')['gene_label'].transform(lambda x: sum(x == 'Matched Gene'))

The single line will also do the same trick and give you the results as shown above.单行也将执行相同的技巧,并为您提供如上所示的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python按列分组并在另一列中计数字符串 - Python Group by a column and count string in another column Python Pandas Group Dataframe按列/ Sum Integer列按String列 - Python Pandas Group Dataframe by Column / Sum Integer Column by String Column Python将列添加到Pandas Dataframe,这是另一列中的列表元素计数 - Python Add Column to Pandas Dataframe That is a Count of List Elements in Another Column Python Pandas DF-组列,另一列具有相应的频率计数 - Python Pandas DF - Group column with corresponding frequency count of another column 计算pandas数据框中另一列对值分组之前的行数 - count number of rows before a value group by another column in pandas dataframe Python Pandas - 无法识别另一个数据帧列中的列的字符串 - Python Pandas - Cannot recognize a string from a column in another dataframe column 使用Pandas数据框按列计算组值 - Group Value Count By Column with Pandas Dataframe 将组中的计数列重命名为 pandas dataframe - Rename a count column in a group by pandas dataframe python pandas:检查数据帧的列值是否在另一个数据帧的列中,然后计数并列出它 - python pandas: Check if dataframe's column value is in another dataframe's column, then count and list it pandas数据帧计数相对于另一列的uniques - pandas dataframe count uniques with respect to another column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM