简体   繁体   中英

PyPDF2 hangs on processing

I'm processing multiple pdf files using PyPDF2 but my script hangs somewhere. All I can see in my console is some "startxref on same line as offset" which I'm correct is a warning so by right it should still go to the finally block and return an empty string.

Am I doing something wrong?

import PyPDF2
import sys
import os
def decode_pdf(src_filename):           
    out_str=""
    try:
        f = open(str(src_filename), "rb")           
        read_pdf = PyPDF2.PdfFileReader(f)
        number_of_pages = read_pdf.getNumPages()
        for i in range(0,number_of_pages):
            page = read_pdf.getPage(i)
            out_str = out_str + " " + page.extractText()
        out_str = ''.join(out_str.splitlines())
        f.close()
    except:
        print("Exception on pdf")
        print(sys.exc_info())
        out_str = ""
    finally:
        return out_str

I was facing this issue too and couldn't solve it using PyPDF2. I solved mine with pdfminer using the example from here

Copying the relevant code here below

from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert(fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = file(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text 

call the function convert() as below

convert('myfile.pdf', pages=[5,7])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM