full_text = ''
i=0
while i<pdf_reader.getNumPages():
pageinfo = pdf_reader.getPage(i)
text += str(pageinfo.extractText())
print(pageinfo.extractText())
i = i + 1
I am attempting to extract all the text from a PDF file, i am able to extract all the texts by for loop. However, i want to save the for loop as a variable for manipulation. After i saved the for loop as all_results, i simply can't do any action afterward. For instance i try to check the length of the text and the output is 0.
If I'm understanding the question correctly, you want to merge all the text into a single variable to be used after the loop is done.
Try this code:
all_pg_text = ''
all_results = 0
for i in range(0, num_of_pages):
print("Page Number: " + str(i))
print("- - - - - - - - - - - - - - - - - - - -")
pageObj = pdf_reader.getPage(i)
pg_text = pageObj.extractText()
print(pg_text) # one page
all_pg_text += pg_text # add to full text
print("- - - - - - - - - - - - - - - - - - - -")
all_results +=i
pdfFile.close()
print(all_pg_text)\
Based on your updated question, this might work:
full_text = ''
i=0
while i < pdf_reader.getNumPages():
pageinfo = pdf_reader.getPage(i)
full_text += str(pageinfo.extractText())
print(pageinfo.extractText())
i = i + 1
print(full_text)
You can use the PDF Miner package to extract the text from the PDF files. Hereby I have attached a sample code (Tested).
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
def convert(fname, pages=None):
if not pages:
# To include all pages
pagenums = set()
else:
# We can specify pages by giving an iterable of pagenumbers
pagenums = set(pages)
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
# Input PDF
infile = open(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
# Close all
infile.close()
converter.close()
txt = output.getvalue()
output.close()
return txt
# Usage
text = convert('/home/stark/Desktop/file.pdf') # Includes all Pages
text = convert('/home/stark/Desktop/file.pdf', pages=[1, 2, 3]) # specify pages here
print(text)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.