Tips for speeding up my python code

Question

I have written a python program that needs to deal with quite large data sets for a machine learning task. I have a train set (about 6 million rows) and a test set (about 2 million rows). So far I my program runs in a reasonable amount of time until I get to the last part of my code. The thing is I have my machine learning algorithm that makes predictions, and I save those predictions into a list. But before I write my predictions to a file I need to do one thing. There are duplicates in my train and test set. I need to find those duplicates in the train set and extract their corresponding label. To achieve this I created a dictionary with my training examples as keys and my labels as values. Afterwards, I create a new list and iterate over my test set and train set. If an example in my test set can be found in my train set append the corresponding labels to my new list, otherwise, append my predictions to my new list.

The actual code I used to achieve the matter I described above:

listed_predictions =  list(predictions)

    """"creating a dictionary"""
    train_dict = dict(izip(train,labels))


    result = []
    for sample in xrange(len(listed_predictions)):
        if test[sample] in train_dict.keys():
            result.append(train_dict[test[sample]])
        else:
            result.append(predictions[sample])

This loop takes roughly 2 million iterations. I thought about numpy arrays, since those should scale better than python lists, but I have no idea how could achieve the same with numpy arrays. Also thought about other optimization solutions like Cython, but before I dive into that, I am hoping that there are low hanging fruits that I, as an inexperienced programmer with no formal computing education, don't see.

Update I have implemented thefourtheye's solution, and it brought my runtime down to about 10 hours, which is fast enough for what I want to achieve. Everybody, thank you for your help and suggestions.

Answer 1

Two suggestions,

To check if a key is in a dict, simply use in and the object (this happens in O(1))
```
 if key in dict: 
```
Use comprehensions whenever possible.

So, your code becomes like this

result = [train_dict.get(test[sample], predictions[sample]) for sample in xrange(len(listed_predictions))]

Answer 2

test[sample] in train_dict.keys() is extremely inefficient. It iterates over all the keys of train_dict looking for the value, when the whole point of dictionaries is supposed to be fast key lookup.

Use test[sample] in train_dict instead -- that change alone might solve your performance issues.

Also, do you actually need results to be a list? It may or may not help performance if you just avoid creating that 2 million entry list. How about:

def results(sample):
    item = test[sample]
    return train_dict[item] if item in train_dict else predictions[sample]

Something to compare for performance:

def results(sample):
    # advantage - only looks up the key once
    # disadvantage - accesses `predictions` whether needed or not,
    # so could be cache inefficient
    return train_dict.get(test[sample], predictions[sample])

We can try to get both advantages:

def results(sample):
    # disadvantage - goes wrong if train_dict contains any value that's false
    return train_dict.get(test[sample]) or performance[sample]

def results(sample):
    # disadvantage - goes wrong if train_dict contains any None value
    value = train_dict.get(test[sample])
    return performance[sample] if value is None else value

def results(sample):
    # disadvantage - exception might be slow, and might be the common case
    try:
        return train_dict[test[sample]]
    except KeyError:
        return predictions[sample]

default_value = object()
def results(sample):
    # disadvantage - kind of obscure
    value = train_dict.get(test[sample], default_value)
    return performance[sample] if value is default_value else value

Of course, all of these functions assume that test and predictions will remain unmodified for as long as you use the results function.

Answer 3

not sure if this will give you a performance boost, but I guess you can try it:

def look_up( x ):
    try:
        return train_dict[ test[ x ] ]
    except KeyError:
        return predictions[ x ]

result = map ( look_up, xrange( len( listed_predictions ) ) )

Answer 4

In Python 2.7, assuming you can form a dictionary of training samples and test samples as:

dict1 = dict(izip(train_samples, labels))
dict2 = dict(izip(test_samples, predictions))

then:

result = dict(dict2.items() + [(k,v) for k,v in dict1.viewitems() if k in dict2])

Gives you the dictionary that always uses the known labels from the training set but is limited in range to only samples from the test set. You can get this back into a list if need be.

There may be faster implementations using Series from pandas or numpy with where and unique .

Tips for speeding up my python code

Question

4 answers

solution1
4 ACCPTED 2013-12-14 14:11:00

solution2
4 2013-12-14 14:18:52

solution3
1 2013-12-14 14:18:46

solution4
1 2013-12-15 05:35:44

Tips for speeding up my python code

Question

4 answers

solution1 4 ACCPTED 2013-12-14 14:11:00

solution2 4 2013-12-14 14:18:52

solution3 1 2013-12-14 14:18:46

solution4 1 2013-12-15 05:35:44

solution1
4 ACCPTED 2013-12-14 14:11:00

solution2
4 2013-12-14 14:18:52

solution3
1 2013-12-14 14:18:46

solution4
1 2013-12-15 05:35:44