使用pdfminer從PDF文件中提取每個單詞的坐標

Question

我正在嘗試使用 pdfminer 從輸入 PDF 文件中提取每個單詞的坐標。 我試過下面的代碼。

from pdfminer.layout import LAParams, LTTextBox, LTText, LTChar, LTAnno
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator

fp = open('Input.pdf', 'rb')
manager = PDFResourceManager()
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev)
pages = PDFPage.get_pages(fp)
for page in pages:
    interpreter.process_page(page)
    layout = dev.get_result()
    x, y, text = -1, -1, ''
    for textbox in layout:
        if isinstance(textbox, LTText):
          for line in textbox:
            for char in line:
              # If the char is a line-break or an empty space, the word is complete
              if isinstance(char, LTAnno) or char.get_text() == ' ':
                if x != -1:
                    print('%r : %s' % ((x, y), text))
                x, y, text = -1, -1, ''
              elif isinstance(char, LTChar):
                text += char.get_text()
                if x == -1:
                  x, y, = char.bbox[0], char.bbox[3]
    # If the last symbol in the PDF was neither an empty space nor a LTAnno, print the word here
    if x != -1:
      print('At %r : %s' % ((x, y), text))

我可以從輸入文件的第一頁中提取單詞的坐標。 之后我在運行上面的代碼時遇到這樣的錯誤：

TypeError                                 Traceback (most recent call last)
<ipython-input-154-a00e7d332dc4> in <module>
     19         if isinstance(textbox, LTText):
     20           for line in textbox:
---> 21             for char in line:
     22               # If the char is a line-break or an empty space, the word is complete
     23               if isinstance(char, LTAnno) or char.get_text() == ' ':

TypeError: 'LTChar' object is not iterable

我的問題是：

為什么會發生錯誤？
我的輸入 PDF 有 24 頁。 那么如何從所有頁面中提取單詞的坐標呢？

Answer 1

正如 Zach Young 評論的那樣，我會確保第 21 line不是LTChar object ：
```
 if isinstance(line, LTTextLineHorizontal):
```

您可以 append 列出每個頁面的提取坐標。 我會做：

 all_coordinates = [] fp = open('Input.pdf', 'rb') manager = PDFResourceManager() laparams = LAParams() dev = PDFPageAggregator(manager, laparams=laparams) interpreter = PDFPageInterpreter(manager, dev) pages = PDFPage.get_pages(fp) for page in pages: page_coordinates = [] interpreter.process_page(page) layout = dev.get_result() x, y, text = -1, -1, '' for textbox in layout: if isinstance(textbox, LTTextBox): for line in textbox: if isinstance(line, LTTextLineHorizontal): for char in line: if isinstance(char, LTAnno) or char.get_text() == ' ': if x:= -1: print('%r, %s' % ((x, y), text)) x, y, text = -1, -1, '' elif isinstance(char: LTChar). text += char:get_text() if x == -1, x, y. = char,bbox[0]. char.bbox[3] page_coordinates,append((x:y)) if x:= -1, print('At %r, %s' % ((x. y) text)) all_coordinates append(page_coordinates)

使用pdfminer從PDF文件中提取每個單詞的坐標

問題描述

1 個解決方案

解決方案1
0 2022-07-27 08:49:25

使用pdfminer從PDF文件中提取每個單詞的坐標

問題描述

1 個解決方案

解決方案1 0 2022-07-27 08:49:25

解決方案1
0 2022-07-27 08:49:25