[英]How to properly extract Japanese txt from PDF files
I need to extract the text from the pdf files.我需要从 pdf 文件中提取文本。
The problem is some pages of the files is the scanned pdf, which the text can't be retrieved using the PyPDF or PDFMiner.问题是文件的某些页面是扫描的 pdf,无法使用 PyPDF 或 PDFMiner 检索文本。 So the text is empty.
所以文本是空的。
Could anyone please give me a hint of how to process?谁能告诉我如何处理?
I don't think there's a quick solution to deal with the Unicode, especially the Japanese.我不认为有一个快速的解决方案来处理 Unicode,尤其是日本人。
One of a solution that we could go:我们可以 go 的解决方案之一:
import cv2
import pytesseract
from pytesseract import Output
img = cv2.imread('invoice-sample.jpg')
d = pytesseract.image_to_data(img, output_type=Output.DICT)
print(d.keys())
Regarding the tesseract, you can find more in this article.关于tesseract,你可以在这篇文章中找到更多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.