I am going to extract text from multiple PDF files. The PDF files include text and some images and even some pages are scanned pages (I assumed the scanned pages are like images). I followed the below commands to extract text from PDF files. My problem is how I can edit my commands with a condition to check if each page contains any images, then extract text from images. I would appreciate it if you could help me.
lst_all_text = []
for foldername,subfolders,files in os.walk(r"C:/MY PATH"):
for file in files:
# open the pdf file
object = PyPDF2.PdfFileReader(os.path.join(foldername,file))
# get number of pages
NumPages = object.getNumPages()
text = ""
# extract text and do the search
for i in range(0, NumPages):
PageObj = object.getPage(i)
text += PageObj.extractText()
lst_all_text.append(text)
It has been a while since I have done this, so I will put here the general method I have followed:
Note: The problem I faced with tesseract was that it became really slow as the number of images increased.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.