Multiprocessing how do I improve this?

Question

I have a script in which I am taking xlsx files and reformatting them and creating a txt document recording the reformatting. The script works well and does what I want it to do. However, It is not as fast as I would like as the multiprocessing is not being fully utilized. At times there may only be a handful of files being reformatted in each "files_xlsx". If I remove the processes.join() it ends up crashing. Ideally, I would like it to work on as many xlsx sheets at a time from multiple "files_xlsx"/directories etc. But I have had no luck in writing my code to do so. Is there an easy alternative to adjust my current code to allow it to work on more xlsx at a time?

Answer 1

The most straighforward way to take advantage of Python's multiprocessing library is to use Pool .

Please review the modifications to your code as seen below. Note that I did not modify def rename_sheets in any way.

# From Python 3.4 onwards, you can use pathlib
from pathlib import Path

def convert_excel_txt(fil): 
# directories is a globally defined variable. Not needed as an argument
# Variable name *file* is not a good idea. 

# This method is to process one and only one file
# The multiprocessing is taken care of by Pool
    open_xl = openpyxl.load_workbook(fil)
    titles = xls.sheet_names()
    # print(len(titles))
    count = 1
    for title in titles:
        # print("{}.| {}".format(count, title))
        sheet_title_value = rename_sheets(title, count, open_xl, fil)
        # We'll navigate to the directory we're working on
        directory = Path(fil).parent
        with open(directory+"\\Reference_Sheets\\"+fil[:-5]+".txt", 'a', encoding='utf-8') as outfile:
                outfile.write('\n'+str(count)+". "+sheet_title_value)
                count +=1


directories = open(r"C:\Python38\Projects\s_&p_500_links_test.txt", "r")

files = []

for directory in directories:
    directory = directory[:-1]
    print(directory)
    report_type = "Annual"
    path = os.chdir(directory)
    files = os.listdir(directory+"\\"+report_type)
    print(files)

files_xlsx = [f for f in files if f[-4:] == 'xlsx']
pool = Pool(24)
pool.map(convert_excel_txt, files_xlsx )

To time the execution of various versions of the code, proceed as follows:

import time
import datetime

overall_start_time = time.time()
print('Started at ', time.strftime('%X %x %Z'))

# timed code goes here

print ("Time elapsed overall (hours:min:sec): %s" % str(datetime.timedelta(seconds=(time.time()- overall_start_time))))

Reference : https://docs.python.org/2/library/multiprocessing.html

Multiprocessing how do I improve this?

Question

1 answers

solution1
1 2020-08-11 13:36:03

Multiprocessing how do I improve this?

Question

1 answers

solution1 1 2020-08-11 13:36:03

solution1
1 2020-08-11 13:36:03