简体   繁体   中英

How to know how many threads /workers from a pool in multiprocessing (python module )has been completed?

I am using imapala shell to compute some stats over a text file containing the table names

I am using python multiprocessing module to pool the processes.
The thing is thing task is very time consuming . So I need to keep track of how many files have been completed to see the job progress.
So let me give you some ideas about the functions that I am using.

job_executor is the function that takes a list of tables and perform the tasks.

main() is the functions , that takes file location , no of executors(pool_workers), converts the file containing table to list of tables and does the multiprocessing thing

I want to see the progress like how much file has been processed by job_executor , But I can't find a solution . Usin a counter also doesn't work . Help Me

def job_executor(text):

    impala_cmd = "impala-shell -i %s -q  'compute stats %s.%s'" % (impala_node, db_name, text)
    impala_cmd_res = os.system(impala_cmd)  #runs impala Command    

    #checks for execution type(success or fail)
    if impala_cmd_res == 0:
        print ("invalidated the metadata.")
    else:
        print("error while performing the operation.")


def main(args):
    text_file_path = args.text_file_path
    NUM_OF_EXECUTORS = int(args.pool_executors)

    with open(text_file_path, 'r') as text_file_reader:
        text_file_rows = text_file_reader.read().splitlines()  # this will return list of all the tables in the file.
        process_pool = Pool(NUM_OF_EXECUTORS)
        try:
            process_pool.map(job_executor, text_file_rows)
            process_pool.close()
            process_pool.join()
        except Exception:
            process_pool.terminate()
            process_pool.join()


def parse_args():
    """
    function to take scrap arguments from  test_hr.sh file
    """
    parser = argparse.ArgumentParser(description='Main Process file that will start the process and session too.')
    parser.add_argument("text_file_path",
                        help='provide text file path/location to be read. ')  # text file fath
    parser.add_argument("pool_executors",
                        help='please provide pool executors as an initial argument') # pool_executor path

    return parser.parse_args() # returns list/tuple of all arguments.


if __name__ == "__main__":
    mail_message_start()

    main(parse_args())

    mail_message_end()

If you insist on needlessly doing it via multiprocessing.pool.Pool() , the easiest way to keep a track of what's going on is to use a non-blocking mapping (ie multiprocessing.pool.Pool.map_async() ):

def main(args):
    text_file_path = args.text_file_path
    NUM_OF_EXECUTORS = int(args.pool_executors)

    with open(text_file_path, 'r') as text_file_reader:
        text_file_rows = text_file_reader.read().splitlines()
        total_processes = len(text_file_rows)  # keep the number of lines for reference
        process_pool = Pool(NUM_OF_EXECUTORS)
        try:
            print('Processing {} lines.'.format(total_processes))
            processing = process_pool.map_async(job_executor, text_file_rows)
            processes_left = total_processes  # number of processing lines left
            while not processing.ready():  # start a loop to wait for all to finish
                if processes_left != processing._number_left:
                    processes_left = processing._number_left
                    print('Processed {} out of {} lines...'.format(
                        total_processes - processes_left, total_processes))
                time.sleep(0.1)  # let it breathe a little, don't forget to `import time`
            print('All done!')
            process_pool.close()
            process_pool.join()
        except Exception:
            process_pool.terminate()
            process_pool.join()

This will check every 100ms if some of the processes finished processing and if something changed since the last check it will print out the number of lines processed so far. If you need more insight into what's going on with your subprocesses, you can use some of the shared structures like multiprocessing.Queue() or multiprocessing.Manager() structures to directly report from within your processes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM