I'm using PyPDF 2 to process some PDF files. I'm hoping to extract outline/ToC data from files that have it, essentially to try to get a sense of which section of the document a given page corresponds to.
According to the docs , PdfFileReader
's getOutlines
method should return a nested list of Destination
objects. Then, according to the docs , each of these should have a page
(int) attribute.
Unfortunately, this isn't the case with the files I've tried. Instead, I get indirectObject
s, which resolve to PyPDF2.generic.DictionaryObject
s. I can't figure out how to get the Destination
objects I'm expecting, or how to extract meaningful page numbers from the indirectObject
s I'm getting instead.
The ultimate goal is to, given an outline's page number, be able to pass that page number to getPage()
and then call extractText()
.
Any guidance much appreciated. Thank you!
PyPDF2.PdfFileReader
has a getDestinationPageNumber
method that gives you the page number from a Destination
object.
However, PyPDF2 is not really updated anymore and outline iteration is broken on Python 3.7. Instead, you might want to try pikepdf
, they also have outline support .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.