简体   繁体   中英

I need some help to optimize a python code

I'm working on a KNN Classifier using Python but I have some problems. The following piece of code takes 7.5s-9.0s to be completed and i'll have to run it for 60.000 times.

        for fold in folds:  
            for dot2 in fold:
                """
                distances[x][0] = Class of the dot2
                distances[x][1] = distance between dot1 and dot2
                """
                distances.append([dot2[0], calc_distance(dot1[1:], dot2[1:], method)])

The "folds" variable is a list with 10 folds that summed contain 60.000 inputs of images in the .csv format. The first value of each dot is the class it belongs to. All the values are in integer. Is there a way to make this line run any faster ?

Here it is the calc_distance function

def calc_distancia(dot1, dot2, distance):

if distance == "manhanttan":
    total = 0
    #for each coord, take the absolute difference
    for x in range(0, len(dot1)):
        total = total + abs(dot1[x] - dot2[x])
    return total

elif distance == "euclidiana":
    total = 0
    for x in range(0, len(dot1)):
        total = total + (dot1[x] - dot2[x])**2
    return math.sqrt(total)

elif distance == "supremum":
    total = 0
    for x in range(0, len(dot1)):
        if abs(dot1[x] - dot2[x]) > total:
            total = abs(dot1[x] - dot2[x])
    return total

elif distance == "cosseno":
    dist = 0
    p1_p2_mul = 0
    p1_sum = 0
    p2_sum = 0
    for x in range(0, len(dot1)):
        p1_p2_mul = p1_p2_mul + dot1[x]*dot2[x]
        p1_sum = p1_sum + dot1[x]**2
        p2_sum = p2_sum + dot2[x]**2
    p1_sum = math.sqrt(p1_sum)
    p2_sum = math.sqrt(p2_sum)
    quociente = p1_sum*p2_sum
    dist = p1_p2_mul/quociente

    return dist

EDIT: Found a way to make it faster at least for the "manhanttan" method. Instead of:

    if distance == "manhanttan":
    total = 0
    #for each coord, take the absolute difference
    for x in range(0, len(dot1)):
        total = total + abs(dot1[x] - dot2[x])
    return total

i put

    if distance == "manhanttan":
    totalp1 = 0
    totalp2 = 0
    #for each coord, take the absolute difference
    for x in range(0, len(dot1)):
        totalp1 += dot1[x]
        totalp2 += dot2[x]

    return abs(totalp1-totalp2)

The abs() call is very heavy

There are many guides to "profiling python"; you should search for some, read them, and walk through the profiling process to ensure you know what parts of your work are taking the most time.

But if this is really the core of your work, it's a fair bet that that calc_distance is where the majority of the running time is being consumed.

Optimizing that deeply will probably require using NumPy accelerated math or a similar, lower-level approach.

As a quick and dirty approach requiring less invasive profiling and rewriting, try installing the PyPy implementation of Python and running under it. I have seen easy 2x or more accelerations compared to the standard (CPython) implementation.

I'm confused. Did you try the profiler?

 python -m cProfile myscript.py

It will show you where the bulk of the time is being consumed and provide hard data to work with. eg. refactor to reduce the number of calls, restructure the input data, substitute this function for that, etc.

https://docs.python.org/3/library/profile.html

In the first place, you should avoid using a single calc_distance function that performs a linear search in a list of strings on every call. Define independent distance functions and call the right one. As Lee Daniel Crocker suggested, don't use the slicing, just start your loop ranges at 1.

For the cosine distance, I would recommend to normalize all the dot vectors once for all. This way the distance computation reduces to a dot product.

These micro-optimization can give you some speedup. But a better gain should be possible by switching to a better algorithm: the kNN classifier calls for a kD-tree , that will allow you to quickly remove a significant fraction of the points from consideration.

This is harder to implement (you'll have to slightly adapt for the different distances; the cosine distance will make it tricky.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM