Using PDFminer as a library: “AttributeError: 'NoneType' object has no attribute 'getobj'”

Question

I am writing a script for uploading PDF files and parsing them in the process. For the parsing i use PDFminer .

For turning the file into a PDFMiner document, i use the following function, neatly following the instructions you can find in the link above:

def load_document(self, _file = None):
    """turn the file into a PDFMiner document"""
    if _file == None:
        _file = self.options['file']

    parser = PDFParser(_file)
    doc = PDFDocument()
    doc.set_parser(parser)
    if self.options['password']:
        password = self.options['password']
    else:
        password = ""
    doc.initialize(password)
    if not doc.is_extractable:
        raise ValueError("PDF text extraction not allowed")

    return doc

The expected result is of course a nice PDFDocument instance, but instead i get an error:

Traceback (most recent call last):
  File "bzk_pdf.py", line 45, in <module>
    cli.run_cli(BZKPDFScraper)
  File "/home/toon/Projects/amcat/amcat/scripts/tools/cli.py", line 61, in run_cli
    instance = cls(options)
  File "/home/toon/Projects/amcat/amcat/scraping/pdf.py", line 44, in __init__
    self.doc = self.load_document()
  File "/home/toon/Projects/amcat/amcat/scraping/pdf.py", line 56, in load_document
    doc.set_parser(parser)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfparser.py", line 327, in set_parser
    self.info.append(dict_value(trailer['Info']))
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 132, in dict_value
    x = resolve1(x)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 60, in resolve1
    x = x.resolve()
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 49, in resolve
    return self.doc.getobj(self.objid)
AttributeError: 'NoneType' object has no attribute 'getobj'

I have no idea where to look, and i have not found anyone else with the same problem.

Some extra info that might help:

here's my test file: http://www.2shared.com/document/kM_wrI3J/testpdf.html
_file is a django File object , but using normal files has the same result
pdfminer version: 'pdfminer-20110515'
Django: 1.4.3 (I don't think it matters)
Python 2.7.3

Answer 1

With some experimenting i have found that i was missing a line:

parser.set_document(doc)

Having added that line, the function now works.

Looks like poor library design to me, but it might be that i've missed something and this just patches up the error.

Anyhow, i've got a PDF document now with the data i need.

Here's the end result:

def load_document(self, _file = None):
    """turn the file into a PDFMiner document"""
    if _file == None:
        _file = self.options['file']

    parser = PDFParser(_file)
    doc = PDFDocument()
    parser.set_document(doc)
    doc.set_parser(parser)

    if 'password' in self.options.keys():
        password = self.options['password']
    else:
        password = ""

    doc.initialize(password)

    if not doc.is_extractable:
        raise ValueError("PDF text extraction not allowed")

    return doc

Answer 2

Try opening the file and sending it to the parser, like this:

with open(_file,'rb') as f:
    parser = PDFParser(f)
    # your normal code here

The way you are doing it now, I suspect you are sending the filename as a string.

Using PDFminer as a library: “AttributeError: 'NoneType' object has no attribute 'getobj'”

Question

2 answers

solution1
2 2013-02-17 12:33:26

solution2
0 2013-02-17 09:28:31

Using PDFminer as a library: “AttributeError: 'NoneType' object has no attribute 'getobj'”

Question

2 answers

solution1 2 2013-02-17 12:33:26

solution2 0 2013-02-17 09:28:31

solution1
2 2013-02-17 12:33:26

solution2
0 2013-02-17 09:28:31