簡體 English 中英

當行數據分為兩個單獨的頁面時，如何正確地從 pdf 中提取表格數據？

[英]how to extract tabular data from pdf properly when a row data is divided in two separate pages?

原文 2020-12-19 16:04:02 3 1 python/ apache-tika/ pdftotext

我的任務是解析來自 pdf 的表格數據。 我在 python 中使用“tika”庫，這很好，但有一個問題如下：

Pdf 具有表格格式的文本，行的一半在第二頁結束，這將表的鍵和值數據划分為兩個不同的頁面，我認為 tika 將這一行視為兩個單獨的行。

在此處輸入圖像描述

output 將在不正確的鍵之間添加值。

例如：

str = "這是長鍵數據xxxxxxx值xxxxxxxxx剩余鍵數據"

有什么建議么？

1 個解決方案

您可以嘗試使用 tesseract psm： Pytesseract OCR multiple config options

To set the different psm in tika (1 is default value) you can either: use the header: X-Tika-OCRPageSegMode: xx or use the tesseract config: https://tika.apache.org/1.24/api/org/ apache/tika/parser/ocr/TesseractOCRConfig.html#setPageSegMode-java.lang.String-

瀏覽 pdf 文件以查找特定頁面並使用 python 從圖像中提取表格數據

[英]Navigate through a pdf file to find specific pages and extract tabular data from image with python

如何從圖像中提取表格數據？

[英]How to extract tabular data from images?

如何以表格格式從發票中提取數據

[英]How to extract data from invoices in tabular format

從表格數據中提取列

[英]Extract column from tabular data

從圖像中提取表格數據

[英]Extract tabular data from images

如何從包含表格數據的圖像中提取數據？

[英]How to extract data from image that contains tabular data?

如何使用python從pdf中提取單行表數據？

[英]How to extract a single row table data from a pdf using python?

如何格式化沒有行標簽的表格數據？

[英]How to format tabular data WITHOUT row labels?

如何從PDF文件中的表格中提取數據？

[英]How to extract data from a table in a PDF file?

如何在單獨的行上顯示列表中的數據

[英]How to display data from a list on a separate row

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 瀏覽 pdf 文件以查找特定頁面並使用 python 從圖像中提取表格數據如何從圖像中提取表格數據？如何以表格格式從發票中提取數據從表格數據中提取列從圖像中提取表格數據如何從包含表格數據的圖像中提取數據？如何使用python從pdf中提取單行表數據？如何格式化沒有行標簽的表格數據？如何從PDF文件中的表格中提取數據？如何在單獨的行上顯示列表中的數據

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM