简体   繁体   English

使用 Python 或其他方法从 PDF 中提取指向另一个 PDF 页面的链接

[英]Extracting links to pages in another PDF from PDF using Python or other method

I have 5 PDF files, each of which have links to different pages in another PDF file.我有 5 个 PDF 文件,每个文件都有指向另一个 PDF 文件中不同页面的链接。 The files are each tables of contents for large PDFs (~1000 pages each), making manual extraction possible, but very painful.这些文件都是大型 PDF 的目录(每个约 1000 页),这使得手动提取成为可能,但非常痛苦。 So far I have tried to open the file in Acrobat Pro, and I can right click on each link and see what page it points to, but I need to extract all the links in some manner.到目前为止,我已经尝试在 Acrobat Pro 中打开该文件,我可以右键单击每个链接并查看它指向的页面,但我需要以某种方式提取所有链接。 I am not opposed to having to do a good amount of further parsing of the links, but I can't seem to pull them out by any means.我并不反对对链接进行大量的进一步解析,但我似乎无法以任何方式将它们拉出来。 I tried to export the PDF from Acrobat Pro as HTML or Word, but both methods didn't maintain the links.我尝试将 Acrobat Pro 中的 PDF 导出为 HTML 或 Word,但这两种方法都没有维护链接。

I'm at my wits end, and any help would be great.我不知所措,任何帮助都会很棒。 I'm comfortable working with Python, or a range of other languages我很擅长使用 Python 或一系列其他语言

Looking for URIs using pyPdf , 使用pyPdf寻找URI,

import pyPdf

f = open('TMR-Issue6.pdf','rb')

pdf = pyPdf.PdfFileReader(f)
pgs = pdf.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'

for pg in range(pgs):

    p = pdf.getPage(pg)
    o = p.getObject()

    if o.has_key(key):
        ann = o[key]
        for a in ann:
            u = a.getObject()
            if u[ank].has_key(uri):
                print u[ank][uri]

gives, 给,

http://www.augustsson.net/Darcs/Djinn/
http://plato.stanford.edu/entries/logic-intuitionistic/
http://citeseer.ist.psu.edu/ishihara98note.html

etc...

I couldn't find a file that had links to another pdf, but I suspect that the URI field should contain URIs of the form file:///myfiles 我找不到链接到另一个pdf的文件,但我怀疑URI字段应包含file:///myfiles形式的URI。

I've just made a small Python tool for exactly this, to list/download all referenced PDFs from a given PDF: https://www.metachris.com/pdfx/ (also: https://github.com/metachris/pdfx ) 我刚刚为此制作了一个小型Python工具,可以从给定的PDF列出/下载所有引用的PDF: https : //www.metachris.com/pdfx/ (另参见: https : //github.com/metachris/ pdfx

$ ./pdfx.py https://weakdh.org/imperfect-forward-secrecy.pdf -d ./
Reading url 'https://weakdh.org/imperfect-forward-secrecy.pdf'...
Saved pdf as './imperfect-forward-secrecy.pdf'
Document infos:
- CreationDate = D:20150821110623-04'00'
- Creator = LaTeX with hyperref package
- ModDate = D:20150821110805-04'00'
- PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
- Producer = pdfTeX-1.40.14
- Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
- Trapped = False
- Pages = 13

Analyzing text...
- URLs: 49
- URLs to PDFs: 17

JSON summary saved as './imperfect-forward-secrecy.pdf.infos.json'

Downloading 17 referenced pdfs...
Created directory './imperfect-forward-secrecy.pdf-referenced-pdfs'
Downloaded 'http://cr.yp.to/factorization/smoothparts-20040510.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/smoothparts-20040510.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35517.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35517.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35514.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35514.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35519.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35519.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35522.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35522.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35509.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35509.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35528.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35528.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35513.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35513.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35533.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35533.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35551.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35551.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35527.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35527.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35520.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35520.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35526.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35526.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35515.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35515.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35529.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35529.pdf'...
Downloaded 'http://cryptome.org/2013/08/spy-budget-fy13.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/spy-budget-fy13.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35671.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35671.pdf'...

The tool uses PyPDF2 , the de-facto Python standard library to read PDF content, a regular expression to match all urls , and starts a download thread for each PDF if you run it with the -d option (for --download-pdfs ). 该工具使用PyPDF2 (事实上​​的Python标准库)读取PDF内容, 匹配所有url正则表达式 ,并使用-d选项运行每个PDF(对于--download-pdfs ),启动下载线程。 。

如果您不能使用 Python,但有一种解压缩 PDF 内部对象流的方法,例如qpdf ,您可以 grep 获取 URI:

qpdf --qdf --object-streams=disable input.pdf - | grep -Poa '(?<=/URI \().*(?=\))'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM