Python 3 parse PDF from web

Question

I was trying to get a PDF from a webpage, parse it and print the result to the screen using PyPDF2 . I got it working without issues with the following code:

with open("foo.pdf", "wb") as f:
    f.write(requests.get(buildurl(jornal, date, page)).content)
pdfFileObj = open('foo.pdf', "rb")
pdf_reader = PyPDF2.PdfFileReader(pdfFileObj)
page_obj = pdf_reader.getPage(0)
print(page_obj.extractText())

Writing a file just so I can then read it though sounded wasteful, so I figured I'd just cut the middleman with this:

pdf_reader = PyPDF2.PdfFileReader(requests.get(buildurl(jornal, date, page)).content)
page_obj = pdf_reader.getPage(0)
print(page_obj.extractText())

This, however yields me an AttributeError: 'bytes' object has no attribute 'seek' . How can I feed the PDF coming from requests directly onto PyPDF2?

Answer 1

You have to convert the returned content to a file-like object using BytesIO :

import io

pdf_content = io.BytesIO(requests.get(buildurl(jornal, date, page)).content)
pdf_reader = PyPDF2.PdfFileReader(pdf_content)

Answer 2

Use io to fake the use of a file (Python 3):

import io

output = io.BytesIO()
output.write(requests.get(buildurl(jornal, date, page)).content)
output.seek(0)
pdf_reader = PyPDF2.PdfFileReader(output)

I did not test in your context but I tested this simple example and it worked:

import io

output = io.BytesIO()
output.write(bytes("hello world","ascii"))
output.seek(0)
print(output.read())

yields:

b'hello world'

Python 3 parse PDF from web

Question

2 answers

solution1
5 ACCPTED 2016-07-30 21:03:11

solution2
2 2016-07-30 21:00:22

Python 3 parse PDF from web

Question

2 answers

solution1 5 ACCPTED 2016-07-30 21:03:11

solution2 2 2016-07-30 21:00:22

solution1
5 ACCPTED 2016-07-30 21:03:11

solution2
2 2016-07-30 21:00:22