I trying to perform OCR using Tesseract OCR on multiple big pdf files (~400-600 pages). I don't necessarily want to extract text from all pages, but I just want a few pages (page numbers are known). The PDF file seems to have some sort of OCR already performed on it, but it isn't a good job. When I run this code that I wrote in Jupyter:
import pdf2image
from PIL import Image
import pytesseract
import cv2
import numpy as np
pytesseract.pytesseract.tesseract_cmd = r"C:/Program Files/Tesseract-OCR/tesseract.exe"
images = pdf2image.convert_from_path("test2.pdf", first_page=3, last_page=3, poppler_path=r"C:/Program Files/poppler-0.68.0/bin")
images[0].show()
I see this output: [
This is what the output should look like:
I do think that the OCR that was done on the PDF is causing some problems here. I am not sure how to bypass it, can someone please help?
I also tried OCR by manually converting the page into an image (snipping tool), and the OCR engine worked. I also tried playing with the options on pdf2image.convert_from_path()
like without the poppler_path
option, or other pages. I tried reading another PDF file, WHICH DID NOT HAVE OCR PERFORMED ON IT , and it seemed to work.
I had the same issue. Since I was unable to fix it, I decided to go with another library.
With the help of another Stack Overflow post and some Googling I was able to modify Mohit Chandel's function to transform a pdf (with multiple pages) in jpg's
import ghostscript
import locale
def pdf2jpeg(pdf_input_path, jpeg_output_path):
"""
Source: https://stackoverflow.com/questions/60701262/convert-pdf-to-image-using-python,
https://www.kite.com/python/answers/how-to-remove-everything-after-a-character-in-a-string-in-python,
https://www.ghostscript.com/doc/current/Use.htm
"""
args = ["pef2jpeg", # actual value doesn't matter
"-dNOPAUSE",
"-sDEVICE=jpeg",
"-r144",
"-sOutputFile=" + jpeg_output_path.split(".", 1)[0] + "-%d.jpg",
pdf_input_path]
encoding = locale.getpreferredencoding()
args = [a.encode(encoding) for a in args]
ghostscript.Ghostscript(*args)
There is nothing wrong with the source OCR, in fact it is better than most similar examples, true there is a glitch here and there but that's due to the source quality thus to be expected and I suspect a second pass would fare much worse.
Here is the OCR (which is readable as searchable text), represented as an image which you suggest you desire to run a second time but all you can do is get worse, never better unless you type any characters that are missing or malformed.
And here it is as TEXT exported to WordPad
First Edition, 5,000 Copies, November 1972
© The Navajivan Trust, 1972
Principal collaborators:
Shankar Prasada, ics (retd.)
Special Secretary, Kashmir Affairs (1958-65)
Chief Commissioner of Delhi (1948-54)
B. L. Sharma
Former Principal Information Officer, Government of India,
Former Special Officer on Kashmir Affairs in the External Affairs
Ministry, New Delhi, and author
Inder Jit
Director-Editor, India News and Feature Alliance and
Editor, The States, New Delhi
Trevor Drieberg
Political Commentator and Feature Writer
Former News Editor, The Indian Express, New Delhi
Uggar Sain
Former News Editor and Assistant Editor,
The Hindustan Times, New Delhi
Printed and Published by Shantilal Harjivan Shah
Navajivan Press, Ahmedabad-14
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.