[英]How to count the instances of unique values in Python Dataframe
I have a dataframe like below where I have 2 million rows.我有一个如下所示的 dataframe,其中有 200 万行。 The sample data can be found here .
可以在此处找到示例数据。
The list of matches in every row can be any number between 1 to 761. I want to count the occurrences of every number between 1 to 761 in the matches column altogether.每行中的匹配列表可以是 1 到 761 之间的任何数字。我想统计匹配列中 1 到 761 之间的每个数字的出现次数。 For example, the result of the above data will be:
例如,上述数据的结果将是:
If a particular id is not found, then the count will be 0 in the output. I tried using for loop approach but it is quite slow.如果未找到特定 ID,则 output 中的计数将为 0。我尝试使用 for 循环方法,但速度很慢。
def readData():
df = pd.read_excel(file_path)
pattern_match_count = [0] * 761
for index, row in df.iterrows():
matches = row["matches"]
for pattern_id in range(1, 762):
if(pattern_id in matches):
pattern_match_count[pattern_id - 1] = pattern_match_count[pattern_id - 1] + 1
Is there any better approach with pandas to make the implementation faster? pandas 是否有更好的方法来加快实施速度?
You can use the .explode()
method to "explode" the lists into new rows.您可以使用
.explode()
方法将列表“分解”为新行。
def readData():
df = pd.read_excel(file_path)
return df.loc[:, "count"].explode().value_counts()
You can use collections.Counter
:您可以使用
collections.Counter
:
df = pd.DataFrame({"matches": [[1,2,3],[1,3,3,4]]})
#df:
# matches
#0 [1, 2, 3]
#1 [1, 3, 3, 4]
from collections import Counter
C = Counter([i for sl in df.matches for i in sl])
#C:
#Counter({1: 2, 2: 1, 3: 3, 4: 1})
pd.DataFrame(C.items(), columns=["match_id", "counts"])
# match_id counts
#0 1 2
#1 2 1
#2 3 3
#3 4 1
If you want zeros for match_id
s that aren't in any of the matches, then you can update the Counter
object C
:如果您想要为不在任何匹配项中的
match_id
零,则可以更新Counter
object C
:
for i in range(1,762):
if i not in C:
C[i] = 0
pd.DataFrame(C.items(), columns=["match_id", "counts"])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.