简体   繁体   中英

Extracting links to pages in another PDF from PDF using Python or other method

I have 5 PDF files, each of which have links to different pages in another PDF file. The files are each tables of contents for large PDFs (~1000 pages each), making manual extraction possible, but very painful. So far I have tried to open the file in Acrobat Pro, and I can right click on each link and see what page it points to, but I need to extract all the links in some manner. I am not opposed to having to do a good amount of further parsing of the links, but I can't seem to pull them out by any means. I tried to export the PDF from Acrobat Pro as HTML or Word, but both methods didn't maintain the links.

I'm at my wits end, and any help would be great. I'm comfortable working with Python, or a range of other languages

Looking for URIs using pyPdf ,

import pyPdf

f = open('TMR-Issue6.pdf','rb')

pdf = pyPdf.PdfFileReader(f)
pgs = pdf.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'

for pg in range(pgs):

    p = pdf.getPage(pg)
    o = p.getObject()

    if o.has_key(key):
        ann = o[key]
        for a in ann:
            u = a.getObject()
            if u[ank].has_key(uri):
                print u[ank][uri]

gives,

http://www.augustsson.net/Darcs/Djinn/
http://plato.stanford.edu/entries/logic-intuitionistic/
http://citeseer.ist.psu.edu/ishihara98note.html

etc...

I couldn't find a file that had links to another pdf, but I suspect that the URI field should contain URIs of the form file:///myfiles

I've just made a small Python tool for exactly this, to list/download all referenced PDFs from a given PDF: https://www.metachris.com/pdfx/ (also: https://github.com/metachris/pdfx )

$ ./pdfx.py https://weakdh.org/imperfect-forward-secrecy.pdf -d ./
Reading url 'https://weakdh.org/imperfect-forward-secrecy.pdf'...
Saved pdf as './imperfect-forward-secrecy.pdf'
Document infos:
- CreationDate = D:20150821110623-04'00'
- Creator = LaTeX with hyperref package
- ModDate = D:20150821110805-04'00'
- PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
- Producer = pdfTeX-1.40.14
- Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
- Trapped = False
- Pages = 13

Analyzing text...
- URLs: 49
- URLs to PDFs: 17

JSON summary saved as './imperfect-forward-secrecy.pdf.infos.json'

Downloading 17 referenced pdfs...
Created directory './imperfect-forward-secrecy.pdf-referenced-pdfs'
Downloaded 'http://cr.yp.to/factorization/smoothparts-20040510.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/smoothparts-20040510.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35517.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35517.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35514.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35514.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35519.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35519.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35522.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35522.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35509.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35509.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35528.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35528.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35513.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35513.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35533.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35533.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35551.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35551.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35527.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35527.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35520.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35520.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35526.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35526.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35515.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35515.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35529.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35529.pdf'...
Downloaded 'http://cryptome.org/2013/08/spy-budget-fy13.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/spy-budget-fy13.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35671.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35671.pdf'...

The tool uses PyPDF2 , the de-facto Python standard library to read PDF content, a regular expression to match all urls , and starts a download thread for each PDF if you run it with the -d option (for --download-pdfs ).

如果您不能使用 Python,但有一种解压缩 PDF 内部对象流的方法,例如qpdf ,您可以 grep 获取 URI:

qpdf --qdf --object-streams=disable input.pdf - | grep -Poa '(?<=/URI \().*(?=\))'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM