簡體 English 中英

如何從PDF個文件中正確提取日語txt

[英]How to properly extract Japanese txt from PDF files

原文 2022-02-22 16:19:04 7 1 python/ algorithm

我需要從 pdf 文件中提取文本。

問題是文件的某些頁面是掃描的 pdf，無法使用 PyPDF 或 PDFMiner 檢索文本。 所以文本是空的。

誰能告訴我如何處理？

1 個解決方案

我不認為有一個快速的解決方案來處理 Unicode，尤其是日本人。

我們可以 go 的解決方案之一：

遍歷頁面，判斷頁面是否為掃描pdf。 這可以使用 PyMUPDF 完成，看看這個答案。
如果頁面不是掃描pdf，我們可以像往常一樣從pdf中提取文本。
對於沒有掃描pdf的頁面，我們可以使用pdf2image將pdf轉換成.png圖片，然后使用pytesseract 提取數據。 這里通過示例代碼介紹如何從圖像中讀取數據。
您可能需要做一些額外的數據工作才能獲得正確的單詞。

import cv2
import pytesseract
from pytesseract import Output

img = cv2.imread('invoice-sample.jpg')

d = pytesseract.image_to_data(img, output_type=Output.DICT)
print(d.keys())

關於tesseract，你可以在這篇文章中找到更多。

如何使用 BeautifulSoup4 從網頁中正確提取 utf8 文本（日語符號）

[英]how to properly extract utf8 text (japanese symbols) from a webpage with BeautifulSoup4

如何正確地從 .txt 文件中提取列名

[英]How to extract column names from .txt file properly

如何從合並的pdf文件中提取文本？

[英]How to extract text from merged pdf files?

如何從pdf文件中提取幾行？

[英]How to extract a few lines from a pdf files?

Python 代碼從 PDF 文檔中提取 txt

[英]Python code to extract txt from PDF document

如何使用Python從多個.txt文件中提取文本？

[英]How to extract text from several .txt files with Python?

從pdf文件創建.txt文件

[英]Creating .txt files from pdf files

如何將 a.txt 中最近 1 年和 6 個月的數據提取到兩個由 '\t' 分隔的單獨的.txt 文件中？

[英]How to extract last 1year and 6months data from a .txt into two separate .txt files delimited by '\t'?

Python：從txt文件中提取浮點數

[英]Python: extract floats from txt files

使用 Python，如何從 PDF 中提取文本和圖像 + 從 output txt 文件中提取顏色字符串和數字

[英]Using Python, how to extract text and images from PDF + color strings and numbers from the output txt file

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 如何使用 BeautifulSoup4 從網頁中正確提取 utf8 文本（日語符號）如何正確地從 .txt 文件中提取列名如何從合並的pdf文件中提取文本？如何從pdf文件中提取幾行？ Python 代碼從 PDF 文檔中提取 txt 如何使用Python從多個.txt文件中提取文本？從pdf文件創建.txt文件如何將 a.txt 中最近 1 年和 6 個月的數據提取到兩個由 '\t' 分隔的單獨的.txt 文件中？ Python：從txt文件中提取浮點數使用 Python，如何從 PDF 中提取文本和圖像 + 從 output txt 文件中提取顏色字符串和數字

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM