简体   繁体   中英

Resolving page numbers from PyPDF2 getOutlines()

I'm using PyPDF 2 to process some PDF files. I'm hoping to extract outline/ToC data from files that have it, essentially to try to get a sense of which section of the document a given page corresponds to.

According to the docs , PdfFileReader 's getOutlines method should return a nested list of Destination objects. Then, according to the docs , each of these should have a page (int) attribute.

Unfortunately, this isn't the case with the files I've tried. Instead, I get indirectObject s, which resolve to PyPDF2.generic.DictionaryObject s. I can't figure out how to get the Destination objects I'm expecting, or how to extract meaningful page numbers from the indirectObject s I'm getting instead.

The ultimate goal is to, given an outline's page number, be able to pass that page number to getPage() and then call extractText() .

Any guidance much appreciated. Thank you!

PyPDF2.PdfFileReader has a getDestinationPageNumber method that gives you the page number from a Destination object.

However, PyPDF2 is not really updated anymore and outline iteration is broken on Python 3.7. Instead, you might want to try pikepdf , they also have outline support .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM