I am working with flask, where I am uploading a pdf file to convert it to an image and perform OCR using pytesseract.
However, pdf2image is not able to read the uploaded image. I tried searching on the internet but I could not find anything.
I tried passing the file storage object directly, but am getting an error, my code looks like this:
log_file = request.files.get('pdf')
images = convert_from_path(log_file)
text = ""
for img in images:
im = img
ocr_dict = pytesseract.image_to_data(im, lang='eng', output_type=Output.DICT)
text += " ".join(ocr_dict['text'])
cleaned_text = clean_text(txt=text)
which gives this error,
**TypeError: expected str, bytes or os.PathLike object, not FileStorage**
I also tried doing,
log_file = request.files.get('pdf')
images = convert_from_path(log_file.read())
text = ""
for img in images:
im = img
ocr_dict = pytesseract.image_to_data(im, lang='eng', output_type=Output.DICT)
text += " ".join(ocr_dict['text'])
cleaned_text = clean_text(txt=text)
which gives error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/pdf2image/pdf2image.py", line 458, in pdfinfo_from_path
proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.8/subprocess.py", line 1639, in _execute_child
self.pid = _posixsubprocess.fork_exec(
ValueError: embedded null byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1516, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1502, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/usr/local/lib/python3.8/dist-packages/flask_restful/__init__.py", line 467, in wrapper
resp = resource(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/flask/views.py", line 84, in view
return current_app.ensure_sync(self.dispatch_request)(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/flask_restful/__init__.py", line 582, in dispatch_request
resp = meth(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/flask_httpauth.py", line 172, in decorated
return self.ensure_sync(f)(*args, **kwargs)
File "/home/ubuntu/Credit_Scoring/API_Script/temp2.py", line 38, in post
json_text = coi_ocr.get_coi_ocr_text()
File "/home/ubuntu/Credit_Scoring/API_Script/ocr_script/certificate_of_incorporation/coi_ocr_script_pdf.py", line 51, in get_coi_ocr_text
text1 = self.extract_text_from_COI()
File "/home/ubuntu/Credit_Scoring/API_Script/ocr_script/certificate_of_incorporation/coi_ocr_script_pdf.py", line 16, in extract_text_from_COI
images = convert_from_path(self.fl)
File "/usr/local/lib/python3.8/dist-packages/pdf2image/pdf2image.py", line 98, in convert_from_path
page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
File "/usr/local/lib/python3.8/dist-packages/pdf2image/pdf2image.py", line 489, in pdfinfo_from_path
"Unable to get page count.\n%s" % err.decode("utf8", "ignore")
UnboundLocalError: local variable 'err' referenced before assignment
Okay, it turns out I need to pass convert_from_bytes instead of convert_from_path .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.