Not able to parallelize pandas apply using swifter

Question

I am trying to correct an OCR parsed words in a document by passing each word through a custom process which is time complex. The process is my custom business functionality which does looks through various semantics of the word.

I am trying to speed up the process using swifter. I have a 16 core processor and I do not see all the cores being utilized as I see only 1 core is consuming 100% with remaining 15 idle. What is that I am missing?

I tried different options like below but to no success. Can someone point me to what I am missing here? df is a dataframe with each row containing a word. correct_ocr_string is a business function that takes string as an input, runs through custom ML model and returns a string..

df['Corrected'] = df.OCR.swifter .progress_bar(False).apply(lambda x: correct_ocr_string(x))

df['Corrected'] = df.OCR.swifter .progress_bar(False).apply(correct_ocr_string)

v_fnc = np.vectorize(correct_ocr_string)
df['Corrected'] = df.OCR.swifter .progress_bar(False).apply(v_fnc)

I tried pandarallel.parallel_apply also with no success

from pandarallel import pandarallel
pandarallel.initialize(nb_workers=multiprocessing.cpu_count())
df['Corrected'] = df.OCR.parallel_apply(correct_ocr_string)

Answer 1

You have to use allow_dask_on_strings(enable=True) :

df.OCR.swifter.allow_dask_on_strings(enable=True).apply(correct_ocr_string)

Is it possible that you use Jupyter Notebook? Multiprocessing may cause problems there (swifter and pandarallel).

Not able to parallelize pandas apply using swifter

Question

1 answers

solution1
1 2020-08-06 14:51:56

Not able to parallelize pandas apply using swifter

Question

1 answers

solution1 1 2020-08-06 14:51:56

solution1
1 2020-08-06 14:51:56