Better python code for reduce memory usage?

Question

I have a data frame of about 19 million rows, which 4 of the variables are latitudes & longitudes. I create a function to calculate distance of latitudes & longitudes with help of python haversine package.

# function to calculate distance of 2 coordinates

def measure_distance(lat_1, long_1, lat_2, long_2):

    coordinate_start = list(zip(lat_1, long_1))
    coodrinate_end = list(zip(lat_2, long_2))
    
    distance = haversine_vector(coordinate_start, coodrinate_end, Unit.KILOMETERS)

    return distance

I use magic command %%memit to measure memory usage to perform the calculation. On average, memory usage is between 8 - 10 GB. I run my work on Google Colab which has 12GB RAM, as a result, sometime the operation hit the limit of runtime and restart.

%%memit

measure_distance(df.station_latitude_start.values, 
                 df.station_longitude_start.values, 
                 df.station_latitude_end.values, 
                 df.station_longitude_end.values)

peak memory: 7981.16 MiB, increment: 5312.66 MiB

Is there a way to optimise my code?

Answer 1

TL;DR: use Numpy and compute the result by chunk .

The amount of memory taken by the CPython interpreter is expected regarding the big input size.

Indeed, CPython stores values in list using references . On a 64-bit system, references takes 8 bytes and basic types (float and small integers) take usually 32 bytes. A tuple of two floats is a complex type that contains the size of the tuple as well as references of the two floats (not values themselves). Its size should be close to 64 bytes. Since you have 2 lists containing 19 million of (reference of) float pairs and 4 list containing 19 million of (reference of) floats, the resulting memory taken should be about 4*19e6*(8+32) + 2*19e6*(8+64) = 5.7 GB . Not to mention that Haversine can make some internal copies and the result take some space too.

If you want to reduce the memory usage, then use Numpy . Indeed, float Numpy arrays store values in a much more compact way (no references, no internal tag). You can replace the list of tuple by a N x 2 Numpy 2D array. The resulting size should be about 4*19e6*8 + 2*19e6*(8*2) = 1.2 GB . Moreover, the computation will be much faster Haversine use Numpy internally. Here is an example:

import numpy as np
 
# Assume lat_1, long_1, lat_2 and long_2 are of type np.array.
# Use np.array(yourList) if you want to convert it.
def measure_distance(lat_1, long_1, lat_2, long_2):
    coordinate_start = np.column_stack((lat_1, long_1))
    coordinate_end = np.column_stack((lat_2, long_2))
    return haversine_vector(coordinate_start, coordinate_end, Unit.KILOMETERS)

The above code is about 25 time faster .

If you want to reduce even more the memory usage, you can compute the coordinate by chunk (for example 32K values) and then concatenate the output chunks. You can also use single precision numbers rather than double precision if you do not care too much about the accuracy of the computed distances.

Here is an example of how to compute the result by chunk:

def better_measure_distance(lat_1, long_1, lat_2, long_2):
    chunckSize = 65536
    result = np.zeros(len(lat_1))
    for i in range(0, len(lat_1), chunckSize):
        coordinate_start = np.column_stack((lat_1[i:i+chunckSize], long_1[i:i+chunckSize]))
        coordinate_end = np.column_stack((lat_2[i:i+chunckSize], long_2[i:i+chunckSize]))
        result[i:i+chunckSize] = haversine_vector(coordinate_start, coordinate_end, Unit.KILOMETERS)
    return result

On my machine, using double precision, the above code takes about 800 MB while the initial implementation take 8 GB. Thus, 10 times less memory ! It is also still 23 times faster ! Using simple precision, the above code takes about 500 MB, so 16 times less memory, and it is 48 times faster !

Better python code for reduce memory usage?

Question

1 answers

solution1
1 ACCPTED 2021-03-14 11:41:59

Better python code for reduce memory usage?

Question

1 answers

solution1 1 ACCPTED 2021-03-14 11:41:59

solution1
1 ACCPTED 2021-03-14 11:41:59