使用joblib使得python在脚本运行时消耗越来越多的RAM

Question

I have a big number of files I want to load, do some processing on, and then store the processed data. 我有大量要加载的文件，进行一些处理，然后存储处理过的数据。 For this I have the following code: 为此，我有以下代码：

from os import listdir
from os.path import dirname, abspath, isfile, join
import pandas as pd
import sys
import time
# Multi-threading
from joblib import Parallel, delayed
import multiprocessing

# Number of cores
TOTAL_NUM_CORES = multiprocessing.cpu_count()
# Path of this script's file
FILES_PATH = dirname(abspath(__file__))

def read_and_convert(f,num_files):
    # Read the file
    dataframe = pd.read_csv(FILES_PATH + '\\Tick\\' + f, low_memory=False, header=None, names=['Symbol', 'Date_Time', 'Bid', 'Ask'], index_col=1, parse_dates=True)
    # Resample the data to have minute-to-minute data, Open-High-Low-Close format.
    data_bid = dataframe['Bid'].resample('60S').ohlc()
    data_ask = dataframe['Ask'].resample('60S').ohlc()
    # Concatenate the OLHC data
    data_ask_bid = pd.concat([data_bid, data_ask], axis=1, keys=['Bid', 'Ask'])
    # Keep only non-weekend data (from Monday 00:00 until Friday 22:00)
    data_ask_bid = data_ask_bid[(((data_ask_bid.index.weekday >= 0) & (data_ask_bid.index.weekday <= 3)) | ((data_ask_bid.index.weekday == 4) & (data_ask_bid.index.hour < 22)))]
    # Save the processed and concatenated data of each month in a different folder
    data_ask_bid.to_csv(FILES_PATH + '\\OHLC\\' + f)
    print(f)

def main():
    start_time = time.time()
    # Get the paths for all the tick data files
    files_names = [f for f in listdir(FILES_PATH + '\\Tick\\') if isfile(join(FILES_PATH + '\\Tick\\', f))]

    num_cores = int(TOTAL_NUM_CORES/2)
    print('Converting Tick data to OHLC...')
    print('Using ' + str(num_cores) + ' cores.')
    # Open and convert files in parallel
    Parallel(n_jobs=num_cores)(delayed(read_and_convert)(f,len(files_names)) for f in files_names)
    # for f in files_names: read_and_convert(f,len(files_names)) # non-parallel
    print("\nTook %s seconds." % (time.time() - start_time))

if __name__ == "__main__":
    main()

The first couple of files are processed really fast this way, but the speed starts getting sloppy as the script processes further and further files. 前几个文件以这种方式处理得非常快，但随着脚本处理更多和更多文件，速度开始变得邋。 As more files are processed, the RAM gets progressively fuller, as seen below. 随着更多文件的处理，RAM逐渐变得更加丰富，如下所示。 Isn't joblib flushing the uneeded data as it cycles through the files? 当joblib在文件中循环时，是不是刷新了不需要的数据？

Answer 1

Adding gc.collect() to the last line of the function you are running on parallel avoids the RAM from getting saturated. 将gc.collect()添加到并行运行的函数的最后一行可以避免RAM饱和。 gc.collect() is Python's garbage collector. gc.collect()是Python的垃圾收集器。

使用joblib使得python在脚本运行时消耗越来越多的RAM

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-02-21 18:47:58

使用joblib使得python在脚本运行时消耗越来越多的RAM

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-02-21 18:47:58

解决方案1
0 已采纳 2017-02-21 18:47:58