简体   繁体   中英

python : problem in saving for loop as a variable

full_text = ''

i=0
while i<pdf_reader.getNumPages():
    pageinfo = pdf_reader.getPage(i)
    text += str(pageinfo.extractText())
    print(pageinfo.extractText())
    i = i + 1

I am attempting to extract all the text from a PDF file, i am able to extract all the texts by for loop. However, i want to save the for loop as a variable for manipulation. After i saved the for loop as all_results, i simply can't do any action afterward. For instance i try to check the length of the text and the output is 0.

If I'm understanding the question correctly, you want to merge all the text into a single variable to be used after the loop is done.

Try this code:

all_pg_text = ''
all_results = 0
for i in range(0, num_of_pages):
    print("Page Number: " + str(i))
    print("- - - - - - - - - - - - - - - - - - - -")
    pageObj = pdf_reader.getPage(i)
    pg_text = pageObj.extractText()
    print(pg_text)  # one page
    all_pg_text += pg_text  # add to full text
    print("- - - - - - - - - - - - - - - - - - - -")
    all_results +=i
pdfFile.close()

print(all_pg_text)\

Based on your updated question, this might work:

full_text = ''

i=0  
while i < pdf_reader.getNumPages():
    pageinfo = pdf_reader.getPage(i)
    full_text += str(pageinfo.extractText())
    print(pageinfo.extractText())
    i = i + 1  

print(full_text)

You can use the PDF Miner package to extract the text from the PDF files. Hereby I have attached a sample code (Tested).

from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage


def convert(fname, pages=None):
    if not pages:
        # To include all pages
        pagenums = set()
    else:
        # We can specify pages by giving an iterable of pagenumbers
        pagenums = set(pages)

    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    # Input PDF
    infile = open(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)

    # Close all
    infile.close()
    converter.close()
    txt = output.getvalue()
    output.close()
    return txt


# Usage
text = convert('/home/stark/Desktop/file.pdf')  # Includes all Pages
text = convert('/home/stark/Desktop/file.pdf', pages=[1, 2, 3])  # specify pages here
print(text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM