简体   繁体   English

优化列表中的计数出现

[英]Optimising counting occurrences in a list

I need to find the most popular pub names in a .CVS file.我需要在 .CVS 文件中找到最流行的酒吧名称。 The way I'm doing it now is by going through the list of pub names to check if it's already there and if yes, adding one to a secondary value and if no, adding it.我现在这样做的方法是浏览酒吧名称列表以检查它是否已经存在,如果已经存在,则将一个添加到次要值,如果没有,则添加它。 ie IE

pub_names = [["Inn",1]]
add "Inn"
pub_names = [["Inn",2]]
add "Pub"
pub_names = [["Inn,2"]["Pub",1]]

(I'll sort them by size later) (稍后我将按大小对它们进行排序)
The problem is that this is incredibly slow as I have 50,000 entries and was wondering if there's a way to optimise it.问题是这非常慢,因为我有 50,000 个条目并且想知道是否有优化它的方法。 The second time it check 1 entry to see if the name is a repeat but the 20,000th it checks 19,999 for 20,001 20,000 and so on.第二次它检查 1 个条目以查看名称是否重复,但第 20,000 次它检查 19,999 为 20,001 20,000,依此类推。

import csv
data = list(csv.reader(open("open_pubs.csv")))
iterdata = iter(data)
next(iterdata)
pub_names = []
for x in iterdata:
    for i in pub_names:
        if x[1] == i[0]:
            i[1] += 1

        full_percent = (data.index(x) / len(data))*100
        sub_percent = (pub_names.index(i) / len(pub_names))*100
        print("Total =",str(full_percent)+"%","Sub =",str(sub_percent)+"%")

    else:
        pub_names += [[x[1],1]]

CSV file: https://www.kaggle.com/rtatman/every-pub-in-england#open_pubs.csv CSV 文件: https : //www.kaggle.com/rtatman/every-pub-in-england#open_pubs.csv

Dictionaries provide much faster element access, and cleaner code in general:字典通常提供更快的元素访问和更清晰的代码:

pubs = {
    "Inn": 2,
    "Pub": 1
}

pubname = "Tavern"
if pubname in pubs:
    pubs[pubname] += 1
else:
    pubs[pubname] = 1

What you can do is to load it into a dataframe and then do a groupby + count.您可以做的是将其加载到数据帧中,然后进行 groupby + count。

This load all the data at once.这一次加载所有数据。 Then, it counts the number of occurrences.然后,它计算出现的次数。

import pandas as pd

df = pd.read_csv('path_to_csv')
df2 = df.groupby('Inn Name')['Inn Name'].count()

This will be faster than any loop since dataframe methods are vectorized.这将比任何循环都快,因为数据帧方法是矢量化的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM