I have a script in which I am taking xlsx files and reformatting them and creating a txt document recording the reformatting. The script works well and does what I want it to do. However, It is not as fast as I would like as the multiprocessing is not being fully utilized. At times there may only be a handful of files being reformatted in each "files_xlsx". If I remove the processes.join() it ends up crashing. Ideally, I would like it to work on as many xlsx sheets at a time from multiple "files_xlsx"/directories etc. But I have had no luck in writing my code to do so. Is there an easy alternative to adjust my current code to allow it to work on more xlsx at a time?
The most straighforward way to take advantage of Python's multiprocessing
library is to use Pool
.
Please review the modifications to your code as seen below. Note that I did not modify def rename_sheets
in any way.
# From Python 3.4 onwards, you can use pathlib
from pathlib import Path
def convert_excel_txt(fil):
# directories is a globally defined variable. Not needed as an argument
# Variable name *file* is not a good idea.
# This method is to process one and only one file
# The multiprocessing is taken care of by Pool
open_xl = openpyxl.load_workbook(fil)
titles = xls.sheet_names()
# print(len(titles))
count = 1
for title in titles:
# print("{}.| {}".format(count, title))
sheet_title_value = rename_sheets(title, count, open_xl, fil)
# We'll navigate to the directory we're working on
directory = Path(fil).parent
with open(directory+"\\Reference_Sheets\\"+fil[:-5]+".txt", 'a', encoding='utf-8') as outfile:
outfile.write('\n'+str(count)+". "+sheet_title_value)
count +=1
directories = open(r"C:\Python38\Projects\s_&p_500_links_test.txt", "r")
files = []
for directory in directories:
directory = directory[:-1]
print(directory)
report_type = "Annual"
path = os.chdir(directory)
files = os.listdir(directory+"\\"+report_type)
print(files)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
pool = Pool(24)
pool.map(convert_excel_txt, files_xlsx )
To time the execution of various versions of the code, proceed as follows:
import time
import datetime
overall_start_time = time.time()
print('Started at ', time.strftime('%X %x %Z'))
# timed code goes here
print ("Time elapsed overall (hours:min:sec): %s" % str(datetime.timedelta(seconds=(time.time()- overall_start_time))))
Reference : https://docs.python.org/2/library/multiprocessing.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.