How to run memory intensive PHP tasks (image conversion and OCR)?

Question

I'm sure if this kind of Q is allowed in StackOverflow, but I'm mostly looking for advice.

I have a web app which accepts PDF uploads, converts them to a TIFF, then OCRs them with Tesseract.

These PDFs are 50 - 200+ pages long. My server completes this for PDFs less than 6 pages.

The resultant TIFF is 1.2GB. The PDF was only 98KB. We have some PDFs which are already hundreds of MBs, so who knows what they'll end up as once converted. This size seems wrong, but let's table it for now.

Once we start talking about 200 page PDFs, nothing works. I get the error:

exec(): Unable to fork [tesseract '/home/forge/default/storage/app/ocr/1.tiff' /tmp/tesseractbO7aur -psm 3  2>&1]

The TIFF conversion works OK, even with large PDFs. But Tesseract always gives this error when PDF is more than ~6 pages.

Perhaps I just need a lot more memory. My questions are:

How can I determine what the limit/max being hit is? How do I know if this is a RAM issue, a CPU issue, something else?

How would you run this? Should I keep this on our web server and just significantly increase the spec? Or would you make another machine dedicated to producing the OCRs? They don't need to be instant in response to user events - it's fine if they upload and the OCR takes a few hours even. I'm used to apps which need a lot of power just taking a long time, not dying entirely. I'd be fine with the OCR taking a very long time, just so long as the process doesn't just fail.

I've only ever worked on simple web apps where the user makes a request and a page is displayed. I'm not used to this sort of stuff. I am using Laravel for the app, so I have access to Redis queues etc if they should be used. I'm using Nginx on AWS. I did consider AWS Lambada but I don't think this can achieve what I need.

Thanks, and I hope someone can help.

Sam

Answer 1

I suspect this isn't really to do with PHP.

First you need to make sure you can actually run this process in Tesseract directly on the command line.

Open two SSH sesssions, in one run something like htop to monitor server resources, then in in the second try running your conversion process manually.

If you see resource usage and load average go crazy in htop then you know you need a beefier server, or to find a more efficient way of running the task.

Only once you know it will work manually on the command line should you try to get this working via PHP.

Even with PHP, I would advise some kind of job queue for scheduling the conversion task.

Answer 2

I solved this by running it on a huge AWS EC2 instance. Smaller EC2 instances give this same issue. Running a 500 page PDF through the conversion and OCR works on a compute optimised c4.4xlarge ($600/month) worked.

How to run memory intensive PHP tasks (image conversion and OCR)?

Question

2 answers

solution1
0 2016-02-09 17:37:14

solution2
0 2016-02-11 09:19:50

How to run memory intensive PHP tasks (image conversion and OCR)?

Question

2 answers

solution1 0 2016-02-09 17:37:14

solution2 0 2016-02-11 09:19:50

solution1
0 2016-02-09 17:37:14

solution2
0 2016-02-11 09:19:50