繁体   English   中英

Pdfminer,struct.error:需要 x 字节的缓冲区

[英]Pdfminer, struct.error: requires buffer of x bytes

我在 macOS 上使用 python 3.10

我有这段代码,我从另一篇文章中得到了一些改动,

from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator


rsrcmgr, laparams = PDFResourceManager(), LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)


fp = open("my_pdf", 'rb')
pages = PDFPage.get_pages(fp)
for page in pages:
    interpreter.process_page(page)
    layout = device.get_result()
    print("It worked")

但是,当我在某些 pdf 上使用它时,它给了我这个错误:

Traceback (most recent call last):
  File "MY_DIRECTORY/create_database.py", line 38, in <module>
    interpreter.process_page(page)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 991, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1010, in render_contents
    self.execute(list_value(streams))
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1036, in execute
    func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 966, in do_Do
    interpreter.render_contents(
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1010, in render_contents
    self.execute(list_value(streams))
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1036, in execute
    func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 903, in do_Tj
    self.do_TJ([s])
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 896, in do_TJ
    self.device.render_string(
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfdevice.py", line 133, in render_string
    textstate.linematrix = self.render_string_horizontal(
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfdevice.py", line 170, in render_string_horizontal
    for cid in font.decode(obj):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdffont.py", line 1174, in decode
    return self.cmap.decode(bytes)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/cmapdb.py", line 136, in decode
    return struct.unpack(">%dH" % n, code)
struct.error: unpack requires a buffer of 6 bytes

是我的代码、pdfminer.six 库有问题,还是某些 pdf 有问题? 我该如何解决?

我解决了它,出于某种原因这部分代码:

rsrcmgr, laparams = PDFResourceManager(), LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)

必须介于这些之间:

fp = open("my_pdf", 'rb')
pages = PDFPage.get_pages(fp)

所以最终的代码是这样的:

from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator


fp = open("my_pdf", 'rb')
rsrcmgr, laparams = PDFResourceManager(), LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)


for page in pages:
    interpreter.process_page(page)
    layout = device.get_result()
    print("It worked")

如果有人知道为什么,请你回答这个帖子,我很高兴学习

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM