简体   繁体   中英

How extract text from PDF including images and text

I am going to extract text from multiple PDF files. The PDF files include text and some images and even some pages are scanned pages (I assumed the scanned pages are like images). I followed the below commands to extract text from PDF files. My problem is how I can edit my commands with a condition to check if each page contains any images, then extract text from images. I would appreciate it if you could help me.

lst_all_text = []

for foldername,subfolders,files in os.walk(r"C:/MY PATH"):
    for file in files:
        # open the pdf file
        object = PyPDF2.PdfFileReader(os.path.join(foldername,file))
        # get number of pages
        NumPages = object.getNumPages()
        text =  ""
        # extract text and do the search
        for i in range(0, NumPages):         
            PageObj = object.getPage(i)
            text += PageObj.extractText() 
            
        lst_all_text.append(text)

It has been a while since I have done this, so I will put here the general method I have followed:

  1. Use the PyMuPDF library to handle the pdf files, it extracts the text as well as images from the PDF files.
  2. After you have extracted text from file, store the names of the extracted images in a list and the images in one directory.
  3. Now, to extract text from images use pytesseract library. This is an open source library for OCR. Loop over the images from the list and use tesseract to extract the text from the images.

Note: The problem I faced with tesseract was that it became really slow as the number of images increased.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM