简体   繁体   中英

Pdf to txt from http request

I have a set of links to pdf files:

https://www.duo.uio.no/bitstream/10852/9012/1/oppgave-2003-10-30.pdf

Some of them are restricted, meaning I won't be able to access the pdf file, while others will go directly to the pdf file itself, like the link above.

I'm currently using the requests package (python) to access the files, but there are far to many files for me to download, and I also don't want the files in pdf.

What I would like to do is go to each link, check if the link is a pdf file, download that file (if necessary), turn it into a txt file, and delete the original pdf file.

I have a shell script that is a very good pdf to txt converter, but is it possible to run a shell script from python?

Yes! It is entirely possible to run shell scripts from python. Take a look at the subprocess python module which allows you to create processes kind of how you would with a shell: https://docs.python.org/2/library/subprocess.html

For example:

import subprocess

process = subprocess.Popen(["echo", "message"], stdout=subprocess.PIPE)

print process.communicate()

There are many tutorials out there eg: http://www.bogotobogo.com/python/python_subprocess_module.php

Kieran Bristow has answered part of your question about how to run an external program from Python.

The other part of your question is about selectively downloading documents by checking whether the resource is a PDF document. Unless the remote server offers alternate representations of their documents (eg a text version), you will need to download the documents. To avoid downloading non-PDF documents you can send an initial HEAD request and look at the reply headers to determine the content-type like this:

import os.path
import requests

session = requests.session()

for url in [
    'https://www.duo.uio.no/bitstream/10852/9012/1/oppgave-2003-10-30.pdf',
    'https://www.duo.uio.no/bitstream/10852abcd/90121023/1234/oppgave-2003-10-30.pdf']:
    try:
        resp = session.head(url, allow_redirects=True)
        resp.raise_for_status()
        if resp.headers['content-type'] == 'application/pdf':
            resp = session.get(url)
            if resp.ok:
                with open(os.path.basename(url), 'wb') as outfile:
                    outfile.write(resp.content)
                    print "Saved {} to file {}".format(url, os.path.basename(url))
            else:
                print 'GET request for URL {} failed with HTTP status "{} {}"'.format(url, resp.status_code, resp.reason)
    except requests.HTTPError as exc:
        print "HEAD failed for URL {} : {}".format(url, exc)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM