[英]Pdfminer, struct.error: requires buffer of x bytes
我在 macOS 上使用 python 3.10
我有这段代码,我从另一篇文章中得到了一些改动,
from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
rsrcmgr, laparams = PDFResourceManager(), LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
fp = open("my_pdf", 'rb')
pages = PDFPage.get_pages(fp)
for page in pages:
interpreter.process_page(page)
layout = device.get_result()
print("It worked")
但是,当我在某些 pdf 上使用它时,它给了我这个错误:
Traceback (most recent call last):
File "MY_DIRECTORY/create_database.py", line 38, in <module>
interpreter.process_page(page)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 991, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1010, in render_contents
self.execute(list_value(streams))
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1036, in execute
func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 966, in do_Do
interpreter.render_contents(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1010, in render_contents
self.execute(list_value(streams))
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1036, in execute
func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 903, in do_Tj
self.do_TJ([s])
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 896, in do_TJ
self.device.render_string(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfdevice.py", line 133, in render_string
textstate.linematrix = self.render_string_horizontal(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdfdevice.py", line 170, in render_string_horizontal
for cid in font.decode(obj):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/pdffont.py", line 1174, in decode
return self.cmap.decode(bytes)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdfminer/cmapdb.py", line 136, in decode
return struct.unpack(">%dH" % n, code)
struct.error: unpack requires a buffer of 6 bytes
是我的代码、pdfminer.six 库有问题,还是某些 pdf 有问题? 我该如何解决?
我解决了它,出于某种原因这部分代码:
rsrcmgr, laparams = PDFResourceManager(), LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
必须介于这些之间:
fp = open("my_pdf", 'rb')
pages = PDFPage.get_pages(fp)
所以最终的代码是这样的:
from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
fp = open("my_pdf", 'rb')
rsrcmgr, laparams = PDFResourceManager(), LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)
for page in pages:
interpreter.process_page(page)
layout = device.get_result()
print("It worked")
如果有人知道为什么,请你回答这个帖子,我很高兴学习
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.