简体   繁体   中英

Python multiprocessing code runs fine, but does not terminate

I have this code (I apologize that it is almost an exact copy-paste from my working code. I don't know where the problem might be, hence I am putting the whole of it here):

def init(Q):
    """Serves to initialize the queue across all child processes"""
    global q
    q = Q

def queue_manager(q):
    """Listens on the queue, and writes pushed data to file"""
    while True:
        data = q.get()
        if data is None:
            break
        key, preds = data
        with pd.HDFStore(hdf_out, mode='a', complevel=5, complib='blosc') as out_store:
            out_store.append(key, preds)

def writer(message):
    """Pushes messages to queue"""
    q.put(message)

def reader(key):
    """Reads data from store, selects required days, processes it"""
    try:
        # Read the data
        with pd.HDFStore(hdf_in, mode='r') as in_store:
            df = in_store[key]
    except KeyError as ke:
        # Almost guaranteed to not happen
        return (key, pd.DataFrame())
    else:
        # Executes only if exception is not raised
        fit_df = df[(df.index >= '2016-09-11') & \
                    (df.index < '2016-09-25') & \
                    (df.index.dayofweek < 5)].copy()
        pre_df = df[(df.index >= '2016-09-18') & \
                    (df.index < '2016-10-2') & \
                    (df.index.dayofweek < 5)].copy()
        del df
        # model_wrapper below is a custom function in another module.
        # It works fine.
        models, preds = model_wrapper(fit_df=fit_df, pre_df=pre_df)
        if preds is not None:
            writer((key, preds))
            del preds
    return (key, models)

def main():
    sensors = pd.read_csv('sens_metadata.csv', index_col=[0])
    nprocs = int(cpu_count() - 0)
    maxproc = 10
    q = Queue()
    t = Thread(target=queue_manager, args=(q,))

    print("Starting process at\t{}".format(dt.now().time()))
    sys.stdout.flush()
    t.start()
    with Pool(processes=nprocs, maxtasksperchild=maxproc, initializer=init,
              initargs=(q,)) as p:
        models = p.map(reader, sensors.index.tolist(), 1)
    print("Processing done at\t{}".format(dt.now().time()))
    print("\nJoining Thread, and finishing writing predictions")
    sys.stdout.flush()
    q.put(None)
    t.join()
    print("Thread joined successfully at\t{}".format(dt.now().time()))
    print("\nConcatenating models and serializing to pickle")
    sys.stdout.flush()
    pd.concat(dict(models)).to_pickle(path + 'models.pickle')
    print("Pickled successfully at\t{}".format(dt.now().time()))

if __name__ == '__main__':
    main()

This code behaves like a badly biased coin toss. Most of the time it does not work, sometimes, it works. When it runs, I know it takes about 2.5 hours to finish running the whole data (all the keys ). 9 out of 10 runs, it will process all the data, I see the data in the hdf_out file, but the multiprocessing pool does not join. All the child processes are active, but not doing any work. I just don't understand why the program might be hung like that.

When that happens, I don't see the "Processing done at ..." and the "Joining Thread, ..." messages I have. Also, if I give it smaller datasets, it finishes. If I exclude calculation of preds it finishes. I cannot exclude calculation of models without heavy modification which won't be conducive to the rest of the project.

I don't know why this might be happening. I am using Linux (Kubuntu 16.04).

Apparently dropping the maxtaskperchild kwag solves the issue. Why is something I don't clearly understand. I suppose it has to do with the distinction between fork process (the default on Linux) and spawn process (the only option on Windows).

With fork process maxtaskperchild is apparently not required because performance is better without it. I noticed memory use improved by dropping maxtaskperchild . Memory is not hogged by the child processes but is shared from the parent process. However, the time I had to use Windows, the maxtaskperchild was a crucial way of preventing a child process from bloating up, especially when running memory intensive tasks with a long task list.

Someone who knows what's happening better, please feel free to edit this answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM