如何优化这个python代码？我需要改进它的运行时间

Question

I want optimize this filter function. 我想优化这个过滤功能。 It is searching in two list: one is of category and one is of tags. 它在两个列表中搜索：一个是类别，一个是标签。 That's why it takes a long time to run this function. 这就是运行此功能需要很长时间的原因。

def get_percentage(l1, l2, sim_score):
    diff = intersection(l1, l2)
    size = len(l1)
    if size != 0:
        perc = (diff/size)
        if perc >= sim_score:
                return True
    else:
        return False

def intersection(lst1, lst2):
    return len(list(set(lst1) & set(lst2)))

def filter_entities(country, city, category, entities, entityId):
    valid_entities = []
    tags = get_tags(entities, entityId)
    for index, i in entities.iterrows():
        if i["country"] == country and i["city"] == city:
            for j in i.categories:
                if j == category:
                    if(get_percentage(i["tags"], tags, 0.80)):
                        valid_entities.append(i.entity_id)

    return valid_entities

Answer 1

You have a couple of unnecessary for loops and if checks in there that you can remove, and you should definitely take advantage of df.loc for selecting elements from your dataframe (assuming entities is a Pandas dataframe): 你有几个不必要的for循环， if你可以删除那里的检查，你绝对应该利用df.loc从数据帧中选择元素（假设entities 是 Pandas数据帧）：

def get_percentage(l1, l2, sim_score):
    if len(l1) == 0:
        return False  # shortcut this default case
    else:
        diff = intersection(l1, l2)
        perc = (diff / len(l1))
        return perc >= sim_score  # rather than handling each case separately

def intersection(lst1, lst2):
    return len(set(lst1).intersection(lst2))  # almost twice as fast this way on my machine

def filter_entities(country, city, category, entities, entityId):
    valid_entities = []
    tags = get_tags(entities, entityId)
    # Just grab the desired elements directly, no loops
    entity = entities.loc[(entities.country == county) &
                          (entities.city == city)]
    if category in entity.categories and get_percentage(entity.tags, tags, 0.8):
        valid_entities.append(entity.entity_id)
    return valid_entities

It's difficult to say for sure that this will help because we can't really run the code you provided, but this should remove some inefficiencies and take advantage of some of the optimizations available in Pandas. 很难确定这会有所帮助，因为我们无法真正运行您提供的代码，但这应该消除一些低效率并利用Pandas中可用的一些优化。

Depending on your data structure (ie if you have multiple matches in entity above), you may need to do something like this for the last three lines above: 根据您的数据结构（即如果您在上面的entity有多个匹配项），您可能需要对上面的最后三行执行类似的操作：

for ent in entity:
    if category in ent.categories and get_percentage(ent.tags, tags, 0.8):
        valid_entities.append(ent.entity_id)
return valid_entities

Answer 2

A first step would be to look at Engineero's answer which fixes the unnecessary if and for loops. 第一步是查看Engineero的答案，该答案修复了不必要的if和for循环。 Next I would suggest if you are using large amounts of input data which should be the case if it taking a noticeably large amount of time. 接下来我会建议你是否使用了大量的输入数据，如果它占用了相当多的时间。 You may want to use a numpy array to store data instead of lists as it is much better for large amounts of data as seen here . 您可能希望使用numpy数组来存储数据而不是列表，因为这对于大量数据来说要好得多，如此处所示。 Numpy even beats out Pandas DataFrames as seen here . Numpy甚至击败了Pandas DataFrames，如图所示。 After a certain point you should ask yourself if efficiency is more important than convenience of using Pandas, and if so for large amounts of data Numpy will be quicker. 在某一点之后，您应该问自己效率是否比使用Pandas的方便更重要，如果是这样，对于大量数据，Numpy会更快。

如何优化这个python代码？我需要改进它的运行时间

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-07-26 15:04:36

解决方案2
1 2018-07-26 15:10:59

如何优化这个python代码？ 我需要改进它的运行时间

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-07-26 15:04:36

解决方案2 1 2018-07-26 15:10:59

如何优化这个python代码？我需要改进它的运行时间

解决方案1
1 已采纳 2018-07-26 15:04:36

解决方案2
1 2018-07-26 15:10:59