简体   繁体   中英

Using joblib makes python consume increasing amounts of RAM as the script runs

I have a big number of files I want to load, do some processing on, and then store the processed data. For this I have the following code:

from os import listdir
from os.path import dirname, abspath, isfile, join
import pandas as pd
import sys
import time
# Multi-threading
from joblib import Parallel, delayed
import multiprocessing

# Number of cores
TOTAL_NUM_CORES = multiprocessing.cpu_count()
# Path of this script's file
FILES_PATH = dirname(abspath(__file__))

def read_and_convert(f,num_files):
    # Read the file
    dataframe = pd.read_csv(FILES_PATH + '\\Tick\\' + f, low_memory=False, header=None, names=['Symbol', 'Date_Time', 'Bid', 'Ask'], index_col=1, parse_dates=True)
    # Resample the data to have minute-to-minute data, Open-High-Low-Close format.
    data_bid = dataframe['Bid'].resample('60S').ohlc()
    data_ask = dataframe['Ask'].resample('60S').ohlc()
    # Concatenate the OLHC data
    data_ask_bid = pd.concat([data_bid, data_ask], axis=1, keys=['Bid', 'Ask'])
    # Keep only non-weekend data (from Monday 00:00 until Friday 22:00)
    data_ask_bid = data_ask_bid[(((data_ask_bid.index.weekday >= 0) & (data_ask_bid.index.weekday <= 3)) | ((data_ask_bid.index.weekday == 4) & (data_ask_bid.index.hour < 22)))]
    # Save the processed and concatenated data of each month in a different folder
    data_ask_bid.to_csv(FILES_PATH + '\\OHLC\\' + f)
    print(f)

def main():
    start_time = time.time()
    # Get the paths for all the tick data files
    files_names = [f for f in listdir(FILES_PATH + '\\Tick\\') if isfile(join(FILES_PATH + '\\Tick\\', f))]

    num_cores = int(TOTAL_NUM_CORES/2)
    print('Converting Tick data to OHLC...')
    print('Using ' + str(num_cores) + ' cores.')
    # Open and convert files in parallel
    Parallel(n_jobs=num_cores)(delayed(read_and_convert)(f,len(files_names)) for f in files_names)
    # for f in files_names: read_and_convert(f,len(files_names)) # non-parallel
    print("\nTook %s seconds." % (time.time() - start_time))

if __name__ == "__main__":
    main()

The first couple of files are processed really fast this way, but the speed starts getting sloppy as the script processes further and further files. As more files are processed, the RAM gets progressively fuller, as seen below. Isn't joblib flushing the uneeded data as it cycles through the files?

在此输入图像描述

Adding gc.collect() to the last line of the function you are running on parallel avoids the RAM from getting saturated. gc.collect() is Python's garbage collector.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM