简体   繁体   English

如何从PDF个文件中正确提取日语txt

[英]How to properly extract Japanese txt from PDF files

I need to extract the text from the pdf files.我需要从 pdf 文件中提取文本。

The problem is some pages of the files is the scanned pdf, which the text can't be retrieved using the PyPDF or PDFMiner.问题是文件的某些页面是扫描的 pdf,无法使用 PyPDF 或 PDFMiner 检索文本。 So the text is empty.所以文本是空的。

Could anyone please give me a hint of how to process?谁能告诉我如何处理?

I don't think there's a quick solution to deal with the Unicode, especially the Japanese.我不认为有一个快速的解决方案来处理 Unicode,尤其是日本人。

One of a solution that we could go:我们可以 go 的解决方案之一:

  • Iterate over the page, determine whether the page is scanned pdf or not.遍历页面,判断页面是否为扫描pdf。 This could be done using the PyMUPDF, take a look at this answer.这可以使用 PyMUPDF 完成,看看这个答案。
  • If the page is not scanned pdf, we can extract the text from pdf as usual.如果页面不是扫描pdf,我们可以像往常一样从pdf中提取文本。
  • For the page which is not scanned pdf, we can convert the pdf into.png image using the pdf2image , than use pytesseract to extract data.对于没有扫描pdf的页面,我们可以使用pdf2image将pdf转换成.png图片,然后使用pytesseract提取数据。 Here by the sample code on how to read the data from image.这里通过示例代码介绍如何从图像中读取数据。
  • You might need to do some extra data work in order to get the properly words.您可能需要做一些额外的数据工作才能获得正确的单词。
import cv2
import pytesseract
from pytesseract import Output

img = cv2.imread('invoice-sample.jpg')

d = pytesseract.image_to_data(img, output_type=Output.DICT)
print(d.keys())

Regarding the tesseract, you can find more in this article.关于tesseract,你可以在这篇文章中找到更多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM