简体   繁体   English

使用joblib使得python在脚本运行时消耗越来越多的RAM

[英]Using joblib makes python consume increasing amounts of RAM as the script runs

I have a big number of files I want to load, do some processing on, and then store the processed data. 我有大量要加载的文件,进行一些处理,然后存储处理过的数据。 For this I have the following code: 为此,我有以下代码:

from os import listdir
from os.path import dirname, abspath, isfile, join
import pandas as pd
import sys
import time
# Multi-threading
from joblib import Parallel, delayed
import multiprocessing

# Number of cores
TOTAL_NUM_CORES = multiprocessing.cpu_count()
# Path of this script's file
FILES_PATH = dirname(abspath(__file__))

def read_and_convert(f,num_files):
    # Read the file
    dataframe = pd.read_csv(FILES_PATH + '\\Tick\\' + f, low_memory=False, header=None, names=['Symbol', 'Date_Time', 'Bid', 'Ask'], index_col=1, parse_dates=True)
    # Resample the data to have minute-to-minute data, Open-High-Low-Close format.
    data_bid = dataframe['Bid'].resample('60S').ohlc()
    data_ask = dataframe['Ask'].resample('60S').ohlc()
    # Concatenate the OLHC data
    data_ask_bid = pd.concat([data_bid, data_ask], axis=1, keys=['Bid', 'Ask'])
    # Keep only non-weekend data (from Monday 00:00 until Friday 22:00)
    data_ask_bid = data_ask_bid[(((data_ask_bid.index.weekday >= 0) & (data_ask_bid.index.weekday <= 3)) | ((data_ask_bid.index.weekday == 4) & (data_ask_bid.index.hour < 22)))]
    # Save the processed and concatenated data of each month in a different folder
    data_ask_bid.to_csv(FILES_PATH + '\\OHLC\\' + f)
    print(f)

def main():
    start_time = time.time()
    # Get the paths for all the tick data files
    files_names = [f for f in listdir(FILES_PATH + '\\Tick\\') if isfile(join(FILES_PATH + '\\Tick\\', f))]

    num_cores = int(TOTAL_NUM_CORES/2)
    print('Converting Tick data to OHLC...')
    print('Using ' + str(num_cores) + ' cores.')
    # Open and convert files in parallel
    Parallel(n_jobs=num_cores)(delayed(read_and_convert)(f,len(files_names)) for f in files_names)
    # for f in files_names: read_and_convert(f,len(files_names)) # non-parallel
    print("\nTook %s seconds." % (time.time() - start_time))

if __name__ == "__main__":
    main()

The first couple of files are processed really fast this way, but the speed starts getting sloppy as the script processes further and further files. 前几个文件以这种方式处理得非常快,但随着脚本处理更多和更多文件,速度开始变得邋。 As more files are processed, the RAM gets progressively fuller, as seen below. 随着更多文件的处理,RAM逐渐变得更加丰富,如下所示。 Isn't joblib flushing the uneeded data as it cycles through the files? 当joblib在文件中循环时,是不是刷新了不需要的数据?

在此输入图像描述

Adding gc.collect() to the last line of the function you are running on parallel avoids the RAM from getting saturated. gc.collect()添加到并行运行的函数的最后一行可以避免RAM饱和。 gc.collect() is Python's garbage collector. gc.collect()是Python的垃圾收集器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM