简体   繁体   中英

How to get byte range of PDF pages?

I'm trying to load a PDF document with the mozilla pdf.js project, and although I've gained enough knowledge of how to load the document to a Page & Zoom Level ( #page=10&zoom=page-fit ), and I checked at the options for the viewer , and found that I could also add range requests to the PDF file via the URL parameters... I don't know how this functions, so I thought I'll ask here...

I have 2 PDF files, and my question is, can I add range parameters to the URL to the PDFs for each of the buttons on the pages to load only the required page of the PDF when clicked on?

I'm currently using XAMPP on my system, and I'm not sure if XAMPP supports range requests (to test), although the site will be uploaded online later. Are range requests commonly supported by webhosts?

How can I get the range in bytes, for all the pages in the 2 PDF files separately? Is there a PHP script or some Windows utility to get the page range (in bytes) from the PDF?

And once found, how can I add these range requests to the viewer.html page when the PDF is being loaded, so that the whole document doesn't get loaded first, and only the page required loaded first, and after that disableAutoFetch=false could let the viewer get the remaining PDF:

get the remaining contents of the PDF if no other range requests are being sent for the PDF file

(read something like that on some blog while searching in incognito, don't remember the URL for that blog, but the pdf.js wiki doesn't mention this on the website).

EDIT: My PDF files are optimized, as per the pdfinfo utility .

Pdf优化

The feature of requesting byte ranges is not meant for end users. It is an implicit requirement for the correct handling of 'linearized' PDFs (commonly also known as 'web optimized' PDFs).

Linearized/web-optimized PDFs can be checked for by this command, for example:

 pdfinfo filename.pdf | grep Optimized:

Linearized PDFs do have an internal structure that is a little bit different. Basically, they are made so that the conforming reader software does not need to download the complete file so it can access the trailer and xref table parts (which in standard PDFs always are at the end of the file).

Trailer and X/Cross Reference Table (which is a sort of internal PDF 'ToC') are needed so the reader software is able to identify the location of the root object within the files, and from there, the pages and all the other objects.

Instead, the reader gets told about the xref and root object locations by different means, and it can start to render the first page (whose objects need to be at the beginning of the file) already while the rest of file/objects/pages are still downloading.

This means a user can then click on bookmarks, internal hyperlinks, or tell the reader "go to page 80" as soon as the first page is visible. The reader then knows from its already processed information which byte range it should request from the conforming web server.

Other questions:

  • No, in a 'standard' PDF the objects that are related to certain page are almost never contiguous (this would be a very rare exception).

  • Yes, the web server needs to support byte range delivery ( 'byte serving' ) too. Yes, all modern web servers can be configured to support this.

  • No, I'm not aware of any utility that reports to you the page range (in bytes) from a PDF (it would work for linearized PDFs only, if so).

TL;DR: Asking for byte range downloads in the context of a PDF is only ever reasonable, if your PDF document is 'web optimized' in the first place! (And requesting a certain byte range has to be done by the viewer , translating a user's request for a certain page into the correct range numbers...)


Update

Resources:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM