將 HOCR 輸出轉換為字符串的策略是什么（用於正則表達式）？

Question

我正在使用 Pytesseract 並希望將 HOCR 輸出轉換為字符串。 當然，這樣的功能是在 Pytesseract 中實現的，但我想更多地了解完成它的可能策略 thx

from pytesseract import image_to_pdf_or_hocr
hocr_output = image_to_pdf_or_hocr(image, extension='hocr')

Answer 1

由於hOCR是一種 .xml，我們可以使用 .xml 解析器。

但首先我們需要將tesseract的二進制輸出轉換為str：

from pytesseract import image_to_pdf_or_hocr

hocr_output = image_to_pdf_or_hocr(image, extension='hocr')
hocr = hocr_output.decode('utf-8')

現在我們可以使用xml.etree來解析它：

import xml.etree.ElementTree as ET

root = ET.fromstring(hocr)

xml.etree 為我們提供了一個文本迭代器，我們可以將其結果連接到單個字符串中：

text = ''.join(root.itertext())