简体   繁体   中英

Python module that can remove the OCRed text layer from one pdf file and move it to another?

I have two pdf files, which are almost the same, except that the first one has OCRed text and the other doesn't, and they have different compressions.

The reason I want to do that is because there is some error in the first file's OCRed text, and the file uses the OCRed text to cover the corresponding image, which makes me unable to know what the correct text is. This is how the second file can help me.

I would like to

  • make the first file show the image, with the OCRed text hidden and not covering the image.

  • Alternatively, move the OCRed text from the first file to the second.

  • Alternatively, remove the OCRed text from the first file, and then re-OCR it, since Adobe Acrobat can't re-OCR a pdf file with OCRed text already.

So I wonder if there is a Python module that can move the OCRed text layer from the first file to the second, while removing the OCRed text layer away from the first file?

If there is no, what languages may have such libraries?

Thanks!

Check out pdfminer; it's not exactly a user-friendly API, but you should be able to navigate the PDF structure and drop the obstructing text. You can come back with specific questions.

But if it's just a question of hiding the OCR, you may be able to hide it if you open the file in Acrobat; IIRC it has options for showing just the OCR, just the background, or both.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM