简体   繁体   English

如何统计Python Dataframe中唯一值的实例

[英]How to count the instances of unique values in Python Dataframe

I have a dataframe like below where I have 2 million rows.我有一个如下所示的 dataframe,其中有 200 万行。 The sample data can be found here .可以在此处找到示例数据。

在此处输入图像描述

The list of matches in every row can be any number between 1 to 761. I want to count the occurrences of every number between 1 to 761 in the matches column altogether.每行中的匹配列表可以是 1 到 761 之间的任何数字。我想统计匹配列中 1 到 761 之间的每个数字的出现次数。 For example, the result of the above data will be:例如,上述数据的结果将是:

在此处输入图像描述

If a particular id is not found, then the count will be 0 in the output. I tried using for loop approach but it is quite slow.如果未找到特定 ID,则 output 中的计数将为 0。我尝试使用 for 循环方法,但速度很慢。

def readData():
    df = pd.read_excel(file_path)

    pattern_match_count = [0] * 761
    for index, row in df.iterrows():
        matches = row["matches"]

        for pattern_id in range(1, 762):
            if(pattern_id in matches):
                pattern_match_count[pattern_id - 1] = pattern_match_count[pattern_id - 1] + 1 

Is there any better approach with pandas to make the implementation faster? pandas 是否有更好的方法来加快实施速度?

You can use the .explode() method to "explode" the lists into new rows.您可以使用.explode()方法将列表“分解”为新行。

def readData():
    df = pd.read_excel(file_path)
    return df.loc[:, "count"].explode().value_counts()

You can use collections.Counter :您可以使用collections.Counter

df = pd.DataFrame({"matches": [[1,2,3],[1,3,3,4]]})

#df:
#        matches
#0     [1, 2, 3]
#1  [1, 3, 3, 4]

from collections import Counter

C = Counter([i for sl in df.matches for i in sl])
#C:  
#Counter({1: 2, 2: 1, 3: 3, 4: 1})

pd.DataFrame(C.items(), columns=["match_id", "counts"]) 
#   match_id  counts
#0         1       2
#1         2       1
#2         3       3
#3         4       1

If you want zeros for match_id s that aren't in any of the matches, then you can update the Counter object C :如果您想要为不在任何匹配项中的match_id零,则可以更新Counter object C

for i in range(1,762):
    if i not in C:
        C[i] = 0
pd.DataFrame(C.items(), columns=["match_id", "counts"]) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM