[英]Extract the coordinates of each word from PDF file using pdfminer
我正在嘗試使用 pdfminer 從輸入 PDF 文件中提取每個單詞的坐標。 我試過下面的代碼。
from pdfminer.layout import LAParams, LTTextBox, LTText, LTChar, LTAnno
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator
fp = open('Input.pdf', 'rb')
manager = PDFResourceManager()
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev)
pages = PDFPage.get_pages(fp)
for page in pages:
interpreter.process_page(page)
layout = dev.get_result()
x, y, text = -1, -1, ''
for textbox in layout:
if isinstance(textbox, LTText):
for line in textbox:
for char in line:
# If the char is a line-break or an empty space, the word is complete
if isinstance(char, LTAnno) or char.get_text() == ' ':
if x != -1:
print('%r : %s' % ((x, y), text))
x, y, text = -1, -1, ''
elif isinstance(char, LTChar):
text += char.get_text()
if x == -1:
x, y, = char.bbox[0], char.bbox[3]
# If the last symbol in the PDF was neither an empty space nor a LTAnno, print the word here
if x != -1:
print('At %r : %s' % ((x, y), text))
我可以從輸入文件的第一頁中提取單詞的坐標。 之后我在運行上面的代碼時遇到這樣的錯誤:
TypeError Traceback (most recent call last)
<ipython-input-154-a00e7d332dc4> in <module>
19 if isinstance(textbox, LTText):
20 for line in textbox:
---> 21 for char in line:
22 # If the char is a line-break or an empty space, the word is complete
23 if isinstance(char, LTAnno) or char.get_text() == ' ':
TypeError: 'LTChar' object is not iterable
我的問題是:
正如 Zach Young 評論的那樣,我會確保第 21 line
不是LTChar object :
if isinstance(line, LTTextLineHorizontal):
您可以 append 列出每個頁面的提取坐標。 我會做:
all_coordinates = [] fp = open('Input.pdf', 'rb') manager = PDFResourceManager() laparams = LAParams() dev = PDFPageAggregator(manager, laparams=laparams) interpreter = PDFPageInterpreter(manager, dev) pages = PDFPage.get_pages(fp) for page in pages: page_coordinates = [] interpreter.process_page(page) layout = dev.get_result() x, y, text = -1, -1, '' for textbox in layout: if isinstance(textbox, LTTextBox): for line in textbox: if isinstance(line, LTTextLineHorizontal): for char in line: if isinstance(char, LTAnno) or char.get_text() == ' ': if x:= -1: print('%r, %s' % ((x, y), text)) x, y, text = -1, -1, '' elif isinstance(char: LTChar). text += char:get_text() if x == -1, x, y. = char,bbox[0]. char.bbox[3] page_coordinates,append((x:y)) if x:= -1, print('At %r, %s' % ((x. y) text)) all_coordinates append(page_coordinates)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.