加快我的python代码的技巧

Question

I have written a python program that needs to deal with quite large data sets for a machine learning task. 我编写了一个python程序，需要为机器学习任务处理相当大的数据集。 I have a train set (about 6 million rows) and a test set (about 2 million rows). 我有一套火车（约600万行）和一套测试（约200万行）。 So far I my program runs in a reasonable amount of time until I get to the last part of my code. 到目前为止，我的程序在合理的时间内运行，直到我到达代码的最后一部分。 The thing is I have my machine learning algorithm that makes predictions, and I save those predictions into a list. 问题是我有我的机器学习算法进行预测，我将这些预测保存到列表中。 But before I write my predictions to a file I need to do one thing. 但在我将预测写入文件之前，我需要做一件事。 There are duplicates in my train and test set. 我的火车和测试装置有重复。 I need to find those duplicates in the train set and extract their corresponding label. 我需要在火车组中找到那些重复项并提取相应的标签。 To achieve this I created a dictionary with my training examples as keys and my labels as values. 为了实现这一点，我创建了一个字典，其中我的训练示例为键，我的标签为值。 Afterwards, I create a new list and iterate over my test set and train set. 然后，我创建一个新列表并迭代我的测试集和训练集。 If an example in my test set can be found in my train set append the corresponding labels to my new list, otherwise, append my predictions to my new list. 如果我的测试集中的示例可以在我的列车中找到，则将相应的标签附加到我的新列表中，否则，将我的预测附加到我的新列表中。

The actual code I used to achieve the matter I described above: 我用来实现上面描述的问题的实际代码：

listed_predictions =  list(predictions)

    """"creating a dictionary"""
    train_dict = dict(izip(train,labels))


    result = []
    for sample in xrange(len(listed_predictions)):
        if test[sample] in train_dict.keys():
            result.append(train_dict[test[sample]])
        else:
            result.append(predictions[sample])

This loop takes roughly 2 million iterations. 该循环大约需要200万次迭代。 I thought about numpy arrays, since those should scale better than python lists, but I have no idea how could achieve the same with numpy arrays. 我想到了numpy数组，因为它们应该比python列表更好地扩展，但我不知道如何用numpy数组实现相同。 Also thought about other optimization solutions like Cython, but before I dive into that, I am hoping that there are low hanging fruits that I, as an inexperienced programmer with no formal computing education, don't see. 还考虑过像Cython这样的其他优化解决方案，但在我深入研究之前，我希望我有一些低成果，作为一个没有经过正规计算机教育的缺乏经验的程序员，我看不到。

Update I have implemented thefourtheye's solution, and it brought my runtime down to about 10 hours, which is fast enough for what I want to achieve. 更新我已经实现了thefourtheye的解决方案，它将我的运行时间缩短到大约10个小时，这对于我想要达到的目标来说足够快。 Everybody, thank you for your help and suggestions. 大家，谢谢你的帮助和建议。

Answer 1

Two suggestions, 两个建议，

To check if a key is in a dict, simply use in and the object (this happens in O(1)) 要检查密钥是否在dict中，只需使用in和对象（这发生在O（1）中）
```
 if key in dict: 
```
Use comprehensions whenever possible. 尽可能使用理解。

So, your code becomes like this 所以，你的代码就像这样

result = [train_dict.get(test[sample], predictions[sample]) for sample in xrange(len(listed_predictions))]

Answer 2

test[sample] in train_dict.keys() is extremely inefficient. test[sample] in train_dict.keys()效率非常低。 It iterates over all the keys of train_dict looking for the value, when the whole point of dictionaries is supposed to be fast key lookup. 它迭代遍历train_dict所有键，寻找值，当字典的整个点应该是快速键查找时。

Use test[sample] in train_dict instead -- that change alone might solve your performance issues. 改为使用test[sample] in train_dict - 单独进行更改可能会解决您的性能问题。

Also, do you actually need results to be a list? 另外，你真的需要results作为一个列表吗？ It may or may not help performance if you just avoid creating that 2 million entry list. 如果您只是避免创建200万个条目列表，它可能会也可能不会有助于提高性能。 How about: 怎么样：

def results(sample):
    item = test[sample]
    return train_dict[item] if item in train_dict else predictions[sample]

Something to compare for performance: 比较性能的东西：

def results(sample):
    # advantage - only looks up the key once
    # disadvantage - accesses `predictions` whether needed or not,
    # so could be cache inefficient
    return train_dict.get(test[sample], predictions[sample])

We can try to get both advantages: 我们可以尝试获得两个优势：

def results(sample):
    # disadvantage - goes wrong if train_dict contains any value that's false
    return train_dict.get(test[sample]) or performance[sample]

def results(sample):
    # disadvantage - goes wrong if train_dict contains any None value
    value = train_dict.get(test[sample])
    return performance[sample] if value is None else value

def results(sample):
    # disadvantage - exception might be slow, and might be the common case
    try:
        return train_dict[test[sample]]
    except KeyError:
        return predictions[sample]

default_value = object()
def results(sample):
    # disadvantage - kind of obscure
    value = train_dict.get(test[sample], default_value)
    return performance[sample] if value is default_value else value

Of course, all of these functions assume that test and predictions will remain unmodified for as long as you use the results function. 当然，只要您使用results函数，所有这些函数都假定test和predictions将保持不变。

Answer 3

not sure if this will give you a performance boost, but I guess you can try it: 不确定这是否会给你带来性能提升，但我想你可以尝试一下：

def look_up( x ):
    try:
        return train_dict[ test[ x ] ]
    except KeyError:
        return predictions[ x ]

result = map ( look_up, xrange( len( listed_predictions ) ) )

Answer 4

In Python 2.7, assuming you can form a dictionary of training samples and test samples as: 在Python 2.7中，假设您可以将训练样本和测试样本的字典形成为：

dict1 = dict(izip(train_samples, labels))
dict2 = dict(izip(test_samples, predictions))

then: 然后：

result = dict(dict2.items() + [(k,v) for k,v in dict1.viewitems() if k in dict2])

Gives you the dictionary that always uses the known labels from the training set but is limited in range to only samples from the test set. 为您提供始终使用训练集中已知标签的字典，但范围仅限于测试集中的样本。 You can get this back into a list if need be. 如果需要，您可以将其恢复到列表中。

There may be faster implementations using Series from pandas or numpy with where and unique . 有可能更快实现使用系列从大熊猫或numpy的 哪里，并可独特。

加快我的python代码的技巧

问题描述

4 个解决方案

解决方案1
4 已采纳 2013-12-14 14:11:00

解决方案2
4 2013-12-14 14:18:52

解决方案3
1 2013-12-14 14:18:46

解决方案4
1 2013-12-15 05:35:44

加快我的python代码的技巧

问题描述

4 个解决方案

解决方案1 4 已采纳 2013-12-14 14:11:00

解决方案2 4 2013-12-14 14:18:52

解决方案3 1 2013-12-14 14:18:46

解决方案4 1 2013-12-15 05:35:44

解决方案1
4 已采纳 2013-12-14 14:11:00

解决方案2
4 2013-12-14 14:18:52

解决方案3
1 2013-12-14 14:18:46

解决方案4
1 2013-12-15 05:35:44