Using joblib makes python consume increasing amounts of RAM as the script runs

Question

I have a big number of files I want to load, do some processing on, and then store the processed data. For this I have the following code:

from os import listdir
from os.path import dirname, abspath, isfile, join
import pandas as pd
import sys
import time
# Multi-threading
from joblib import Parallel, delayed
import multiprocessing

# Number of cores
TOTAL_NUM_CORES = multiprocessing.cpu_count()
# Path of this script's file
FILES_PATH = dirname(abspath(__file__))

def read_and_convert(f,num_files):
    # Read the file
    dataframe = pd.read_csv(FILES_PATH + '\\Tick\\' + f, low_memory=False, header=None, names=['Symbol', 'Date_Time', 'Bid', 'Ask'], index_col=1, parse_dates=True)
    # Resample the data to have minute-to-minute data, Open-High-Low-Close format.
    data_bid = dataframe['Bid'].resample('60S').ohlc()
    data_ask = dataframe['Ask'].resample('60S').ohlc()
    # Concatenate the OLHC data
    data_ask_bid = pd.concat([data_bid, data_ask], axis=1, keys=['Bid', 'Ask'])
    # Keep only non-weekend data (from Monday 00:00 until Friday 22:00)
    data_ask_bid = data_ask_bid[(((data_ask_bid.index.weekday >= 0) & (data_ask_bid.index.weekday <= 3)) | ((data_ask_bid.index.weekday == 4) & (data_ask_bid.index.hour < 22)))]
    # Save the processed and concatenated data of each month in a different folder
    data_ask_bid.to_csv(FILES_PATH + '\\OHLC\\' + f)
    print(f)

def main():
    start_time = time.time()
    # Get the paths for all the tick data files
    files_names = [f for f in listdir(FILES_PATH + '\\Tick\\') if isfile(join(FILES_PATH + '\\Tick\\', f))]

    num_cores = int(TOTAL_NUM_CORES/2)
    print('Converting Tick data to OHLC...')
    print('Using ' + str(num_cores) + ' cores.')
    # Open and convert files in parallel
    Parallel(n_jobs=num_cores)(delayed(read_and_convert)(f,len(files_names)) for f in files_names)
    # for f in files_names: read_and_convert(f,len(files_names)) # non-parallel
    print("\nTook %s seconds." % (time.time() - start_time))

if __name__ == "__main__":
    main()

The first couple of files are processed really fast this way, but the speed starts getting sloppy as the script processes further and further files. As more files are processed, the RAM gets progressively fuller, as seen below. Isn't joblib flushing the uneeded data as it cycles through the files?

Answer 1

Adding gc.collect() to the last line of the function you are running on parallel avoids the RAM from getting saturated. gc.collect() is Python's garbage collector.

Using joblib makes python consume increasing amounts of RAM as the script runs

Question

1 answers

solution1
0 ACCPTED 2017-02-21 18:47:58

Using joblib makes python consume increasing amounts of RAM as the script runs

Question

1 answers

solution1 0 ACCPTED 2017-02-21 18:47:58

solution1
0 ACCPTED 2017-02-21 18:47:58