简体   繁体   中英

How to open web based PDF with pdf2txt in Python3

I'm successfuly parsing local PDFs with pdfminer pdf2txt in Python3. I use the following code:

Python3 pdf2txt.py -A -M 15.0 -L 0.3 -W 0.2 -F 0.5 -V -o output.txt -t text input.pdf

I was wondering if there is any way I can use pdf web link instead of local file. I'm not sure how I can declare this. I tried with quotes and parentheses but there is error.

Python has urllib in the standard library, for retrieving the contents of a URL you can use urlretrieve :

import urllib2
urllib.urlretrieve('http://www.example.com/myfile.pdf', 'myfile_local.pdf')

In Python 3 I believe this is tucked away slightly deeper, in urllib.request.urlretrieve

I don't know what OS you're using but you also might want to just use something like the wget program from the command line, that way you don't have to write any Python code to do the retrieval.

Unfortunately pdf2txt.py doesn't support the parsing of streamed PDF documents. The internals require seeking within the file, which is difficult to achieve with a stream.

Your only option is to download the PDF document to your file system and then call pdf2txt.py on it. There a lots of tools to download URL resources, eg curl, wget, et al., or you could write your own with Python.

You could easily make a shell, batch, or Python script to download the PDF file to a temporary file, run pdf2txt.py , and then clean up.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM