简体   繁体   中英

How to read PDF from file storage object in pdf2image?

I am working with flask, where I am uploading a pdf file to convert it to an image and perform OCR using pytesseract.

However, pdf2image is not able to read the uploaded image. I tried searching on the internet but I could not find anything.

I tried passing the file storage object directly, but am getting an error, my code looks like this:

log_file = request.files.get('pdf')
images = convert_from_path(log_file)
text = ""
for img in images:
  im = img

  ocr_dict = pytesseract.image_to_data(im, lang='eng', output_type=Output.DICT)
  text += " ".join(ocr_dict['text'])
  cleaned_text = clean_text(txt=text)

which gives this error,

**TypeError: expected str, bytes or os.PathLike object, not FileStorage**

I also tried doing,

log_file = request.files.get('pdf')
images = convert_from_path(log_file.read())
text = ""
for img in images:
  im = img

  ocr_dict = pytesseract.image_to_data(im, lang='eng', output_type=Output.DICT)
  text += " ".join(ocr_dict['text'])
  cleaned_text = clean_text(txt=text)

which gives error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pdf2image/pdf2image.py", line 458, in pdfinfo_from_path
    proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
  File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.8/subprocess.py", line 1639, in _execute_child
    self.pid = _posixsubprocess.fork_exec(
ValueError: embedded null byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/usr/local/lib/python3.8/dist-packages/flask_restful/__init__.py", line 467, in wrapper
    resp = resource(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/flask/views.py", line 84, in view
    return current_app.ensure_sync(self.dispatch_request)(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/flask_restful/__init__.py", line 582, in dispatch_request
    resp = meth(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/flask_httpauth.py", line 172, in decorated
    return self.ensure_sync(f)(*args, **kwargs)
  File "/home/ubuntu/Credit_Scoring/API_Script/temp2.py", line 38, in post
    json_text = coi_ocr.get_coi_ocr_text()
  File "/home/ubuntu/Credit_Scoring/API_Script/ocr_script/certificate_of_incorporation/coi_ocr_script_pdf.py", line 51, in get_coi_ocr_text
    text1 = self.extract_text_from_COI()
  File "/home/ubuntu/Credit_Scoring/API_Script/ocr_script/certificate_of_incorporation/coi_ocr_script_pdf.py", line 16, in extract_text_from_COI
    images = convert_from_path(self.fl)
  File "/usr/local/lib/python3.8/dist-packages/pdf2image/pdf2image.py", line 98, in convert_from_path
    page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
  File "/usr/local/lib/python3.8/dist-packages/pdf2image/pdf2image.py", line 489, in pdfinfo_from_path
    "Unable to get page count.\n%s" % err.decode("utf8", "ignore")
UnboundLocalError: local variable 'err' referenced before assignment

Okay, it turns out I need to pass convert_from_bytes instead of convert_from_path .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM