简体   繁体   中英

Python code to extract txt from PDF document

I have been trying to convert some PDFs into.txt, but most sample codes I found online have the same issue: They only convert one page at a time. I am kinda new to python, and I am not finding how to write a substitute for the.GetPage() method to convert the entire document at once. All help is welcomed.

import PyPDF2
 
pdfFileObject = open(r"F:\pdf.pdf", 'rb')
 
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
 
print(" No. Of Pages :", pdfReader.numPages)
 
pageObject = pdfReader.getPage(0)
 
print(pageObject.extractText())
 
pdfFileObject.close()

You could do this with a for loop. Extract the text from the pages in the loop and append them to a list.

import PyPDF2

pages_text=[]
with open(r"F:\pdf.pdf", 'rb') as pdfFileObject:
    pdfReader = PyPDF2.PdfFileReader(pdfFileObject)

    print(" No. Of Pages :", pdfReader.numPages)
    for page in range(pdfReader.numPages):
        pageObject = pdfReader.getPage(page)
        pages_text.append(pageObject.extractText())

print(pages_text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM