简体   繁体   中英

Bad performance with python multiprocessing using opencv and tesseract

I keep trying to parallelized my code that is purpose to extract text from various videos, implementing OCR making use of library OpenCV, specifically from 9 videos, of those 9 videos are divided into 3 categories, allowing them to be analyzed in the same way for each category. This is why I coded 3 principal functions, 1 for each category, this means that I can reuse a function for 3 videos of the same category. The functions takes between 90 and 200 seconds running independently. If I wanted to analyze 3 videos in the same execution, the result will be a much longer execution time, because the functions will execute sequentially.

It is for this reason that I decided to use the multiprocessing module, I finally got to make the functions run in parallel, however I did not get the expected performance. When I execute 2 process in parallel, 1 video for each process, execution time increases approx 10% - 15%, that's ok. But, when I execute 3 process in parallel, 1 video for each process, execution time increases drastically, in fact, I detected that the processes stopped executing, due the silence that my cpu cooler made. I checked this using htop for my linux system (ubuntu 20.04.2 LTS), and so it really was, when executing 3 processes in parallel, on a certain moment, the 6 cores of the cpu reached their limit (100%), causing processes to stop.

cpu usage - htop monitoring system

I found a way to partially fix it, I did it by separating the start time of the executions, in this way the processes at times did not use 100% of cores, getting an acceptable execution time. But, I still need to analyze more videos in parallel, 3 is still few videos. Is there any way to increase performance? I really didn't expect this performance for Python, considering it's running on a i5-8600k and 16gb Ram - 3200MHz.

important to mention:

  • various for loops in the functions are what cause excessive use of CPU, these loops have methods from the OpenCV library, and are required for extraction data.
  • processes are not transferring data between them.

If you want to check the code, you will find this in: GitHub repository

I keep trying to parallelized my code that is purpose to extract text from various videos, implementing OCR making use of library OpenCV, specifically from 9 videos, of those 9 videos are divided into 3 categories, allowing them to be analyzed in the same way for each category. This is why I coded 3 principal functions, 1 for each category, this means that I can reuse a function for 3 videos of the same category. The functions takes between 90 and 200 seconds running independently. If I wanted to analyze 3 videos in the same execution, the result will be a much longer execution time, because the functions will execute sequentially.

It is for this reason that I decided to use the multiprocessing module, I finally got to make the functions run in parallel, however I did not get the expected performance. When I execute 2 process in parallel, 1 video for each process, execution time increases approx 10% - 15%, that's ok. But, when I execute 3 process in parallel, 1 video for each process, execution time increases drastically, in fact, I detected that the processes stopped executing, due the silence that my cpu cooler made. I checked this using htop for my linux system (ubuntu 20.04.2 LTS), and so it really was, when executing 3 processes in parallel, on a certain moment, the 6 cores of the cpu reached their limit (100%), causing processes to stop.

cpu usage - htop monitoring system

I found a way to partially fix it, I did it by separating the start time of the executions, in this way the processes at times did not use 100% of cores, getting an acceptable execution time. But, I still need to analyze more videos in parallel, 3 is still few videos. Is there any way to increase performance? I really didn't expect this performance for Python, considering it's running on a i5-8600k and 16gb Ram - 3200MHz.

important to mention:

  • various for loops in the functions are what cause excessive use of CPU, these loops have methods from the OpenCV library, and are required for extraction data.
  • processes are not transferring data between them.

If you want to check the code, you will find this in: GitHub repository

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM