簡體   English   中英

如何從pdf2image中的文件存儲object中讀取PDF?

[英]How to read PDF from file storage object in pdf2image?

我正在使用 flask,我正在上傳 pdf 文件以將其轉換為圖像並使用 pytesseract 執行 OCR。

但是,pdf2image 無法讀取上傳的圖像。 我嘗試在互聯網上搜索,但找不到任何東西。

我嘗試直接傳遞文件存儲 object,但出現錯誤,我的代碼如下所示:

log_file = request.files.get('pdf')
images = convert_from_path(log_file)
text = ""
for img in images:
  im = img

  ocr_dict = pytesseract.image_to_data(im, lang='eng', output_type=Output.DICT)
  text += " ".join(ocr_dict['text'])
  cleaned_text = clean_text(txt=text)

這給出了這個錯誤,

**TypeError: expected str, bytes or os.PathLike object, not FileStorage**

我也試過做,

log_file = request.files.get('pdf')
images = convert_from_path(log_file.read())
text = ""
for img in images:
  im = img

  ocr_dict = pytesseract.image_to_data(im, lang='eng', output_type=Output.DICT)
  text += " ".join(ocr_dict['text'])
  cleaned_text = clean_text(txt=text)

這給出了錯誤:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pdf2image/pdf2image.py", line 458, in pdfinfo_from_path
    proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
  File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.8/subprocess.py", line 1639, in _execute_child
    self.pid = _posixsubprocess.fork_exec(
ValueError: embedded null byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/usr/local/lib/python3.8/dist-packages/flask_restful/__init__.py", line 467, in wrapper
    resp = resource(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/flask/views.py", line 84, in view
    return current_app.ensure_sync(self.dispatch_request)(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/flask_restful/__init__.py", line 582, in dispatch_request
    resp = meth(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/flask_httpauth.py", line 172, in decorated
    return self.ensure_sync(f)(*args, **kwargs)
  File "/home/ubuntu/Credit_Scoring/API_Script/temp2.py", line 38, in post
    json_text = coi_ocr.get_coi_ocr_text()
  File "/home/ubuntu/Credit_Scoring/API_Script/ocr_script/certificate_of_incorporation/coi_ocr_script_pdf.py", line 51, in get_coi_ocr_text
    text1 = self.extract_text_from_COI()
  File "/home/ubuntu/Credit_Scoring/API_Script/ocr_script/certificate_of_incorporation/coi_ocr_script_pdf.py", line 16, in extract_text_from_COI
    images = convert_from_path(self.fl)
  File "/usr/local/lib/python3.8/dist-packages/pdf2image/pdf2image.py", line 98, in convert_from_path
    page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
  File "/usr/local/lib/python3.8/dist-packages/pdf2image/pdf2image.py", line 489, in pdfinfo_from_path
    "Unable to get page count.\n%s" % err.decode("utf8", "ignore")
UnboundLocalError: local variable 'err' referenced before assignment

好的,事實證明我需要傳遞convert_from_bytes而不是convert_from_path

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM