简体   繁体   中英

How to print a pdf file to stdout using python?

A correct pdf file has been created by a script (whose output can't be directly written to stdout, unfortunately). Say the file's name is 'myfile.pdf'.

I want to print the exact pdf content to stdout. (No processing in between).

To test this, I have written this short read_pdf.py script:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

with open('myfile.pdf', mode='rb') as pdf_file:
    for line in pdf_file:
        print(str(line))

I use the 'rb' mode because reading this in text mode leads to a UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 10: invalid continuation byte . So, it doesn't look like there's any other alternative (if text mode doesn't work, then binary mode).

Now of course the problem is that the output consists of b'blablabla' lines that cannot be used as a pdf file. To check it, I redirect read_pdf.py to a file and try to open it with a pdf viewer and of course it doesn't work:

$ ./read_pdf.py > test_output.pdf
$ evince test_output.pdf
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table

So, what is the right way to do it? I haven't checked any pdf dedicated library because it doesn't look necessary, I'd like to be able to read and print correct content without importing a pdf library for that.

chardet.detect(pdf_file.read()) couldn't help (it returned {'encoding': None, 'confidence': 0.0} ).

EDIT: * I'm looking for a solution for python3 and for a Linux/Unix system, not windows. * I need to know how to do this in python because its's actually part of a bigger project entirely written in python

I think your problem is that you are reading line by line, therefore adding extra carriage returns. I tried and works perfectly on OSX:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

with open('myfile.pdf', mode='rb') as pdf_file:
        print(pdf_file.read())

For the sake of completeness, as noted by @zezollo, in Linux the file will still be corrupted using the print function, so it is necessary to write directly on the buffer:

import sys

with open('myfile.pdf', mode='rb') as pdf_file:
    sys.stdout.buffer.write(pdf_file.read())

The answer is actually to use sys.stdout.buffer.write() , instead of print() , and in addition to pdf_file.read() :

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import sys

with open('myfile.pdf', mode='rb') as pdf_file:
    sys.stdout.buffer.write(pdf_file.read())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM