簡體   English   中英

使用pdfminer從PDF文件中提取每個單詞的坐標

[英]Extract the coordinates of each word from PDF file using pdfminer

我正在嘗試使用 pdfminer 從輸入 PDF 文件中提取每個單詞的坐標。 我試過下面的代碼。

from pdfminer.layout import LAParams, LTTextBox, LTText, LTChar, LTAnno
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator

fp = open('Input.pdf', 'rb')
manager = PDFResourceManager()
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev)
pages = PDFPage.get_pages(fp)
for page in pages:
    interpreter.process_page(page)
    layout = dev.get_result()
    x, y, text = -1, -1, ''
    for textbox in layout:
        if isinstance(textbox, LTText):
          for line in textbox:
            for char in line:
              # If the char is a line-break or an empty space, the word is complete
              if isinstance(char, LTAnno) or char.get_text() == ' ':
                if x != -1:
                    print('%r : %s' % ((x, y), text))
                x, y, text = -1, -1, ''
              elif isinstance(char, LTChar):
                text += char.get_text()
                if x == -1:
                  x, y, = char.bbox[0], char.bbox[3]
    # If the last symbol in the PDF was neither an empty space nor a LTAnno, print the word here
    if x != -1:
      print('At %r : %s' % ((x, y), text))

我可以從輸入文件的第一頁中提取單詞的坐標。 之后我在運行上面的代碼時遇到這樣的錯誤:

TypeError                                 Traceback (most recent call last)
<ipython-input-154-a00e7d332dc4> in <module>
     19         if isinstance(textbox, LTText):
     20           for line in textbox:
---> 21             for char in line:
     22               # If the char is a line-break or an empty space, the word is complete
     23               if isinstance(char, LTAnno) or char.get_text() == ' ':

TypeError: 'LTChar' object is not iterable

我的問題是:

  1. 為什么會發生錯誤?
  2. 我的輸入 PDF 有 24 頁。 那么如何從所有頁面中提取單詞的坐標呢?
  1. 正如 Zach Young 評論的那樣,我會確保第 21 line不是LTChar object :

     if isinstance(line, LTTextLineHorizontal):
  2. 您可以 append 列出每個頁面的提取坐標。 我會做:

     all_coordinates = [] fp = open('Input.pdf', 'rb') manager = PDFResourceManager() laparams = LAParams() dev = PDFPageAggregator(manager, laparams=laparams) interpreter = PDFPageInterpreter(manager, dev) pages = PDFPage.get_pages(fp) for page in pages: page_coordinates = [] interpreter.process_page(page) layout = dev.get_result() x, y, text = -1, -1, '' for textbox in layout: if isinstance(textbox, LTTextBox): for line in textbox: if isinstance(line, LTTextLineHorizontal): for char in line: if isinstance(char, LTAnno) or char.get_text() == ' ': if x:= -1: print('%r, %s' % ((x, y), text)) x, y, text = -1, -1, '' elif isinstance(char: LTChar). text += char:get_text() if x == -1, x, y. = char,bbox[0]. char.bbox[3] page_coordinates,append((x:y)) if x:= -1, print('At %r, %s' % ((x. y) text)) all_coordinates append(page_coordinates)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM