简体   繁体   English

如何获取PDF页面的字节范围?

[英]How to get byte range of PDF pages?

I'm trying to load a PDF document with the mozilla pdf.js project, and although I've gained enough knowledge of how to load the document to a Page & Zoom Level ( #page=10&zoom=page-fit ), and I checked at the options for the viewer , and found that I could also add range requests to the PDF file via the URL parameters... I don't know how this functions, so I thought I'll ask here... 我正在尝试通过mozilla pdf.js项目加载PDF文档,尽管我已经获得了有关如何将文档加载到页面和缩放级别( #page #page=10&zoom=page-fit )的足够知识,但是我检查了查看器的选项,发现我还可以通过URL参数将范围请求添加到PDF文件...我不知道它是如何工作的,所以我想在这里询问...

I have 2 PDF files, and my question is, can I add range parameters to the URL to the PDFs for each of the buttons on the pages to load only the required page of the PDF when clicked on? 我有2个PDF文件,我的问题是,是否可以为页面上的每个按钮的PDF的URL添加范围参数,以在单击时仅加载所需的PDF页面?

I'm currently using XAMPP on my system, and I'm not sure if XAMPP supports range requests (to test), although the site will be uploaded online later. 我目前正在系统上使用XAMPP,但不确定该XAMPP是否支持范围请求(以进行测试),尽管该站点稍后将在线上传。 Are range requests commonly supported by webhosts? Web主机通常支持范围请求吗?

How can I get the range in bytes, for all the pages in the 2 PDF files separately? 如何分别获取2个PDF文件中的所有页面的字节范围? Is there a PHP script or some Windows utility to get the page range (in bytes) from the PDF? 是否有PHP脚本或Windows实用程序从PDF获取页面范围(以字节为单位)?

And once found, how can I add these range requests to the viewer.html page when the PDF is being loaded, so that the whole document doesn't get loaded first, and only the page required loaded first, and after that disableAutoFetch=false could let the viewer get the remaining PDF: 并且一旦找到,当加载PDF时,如何将这些范围请求添加到viewer.html页面,这样就不会首先加载整个文档,而是仅首先加载所需的页面,然后disableAutoFetch=false可以让查看者获取剩余的PDF:

get the remaining contents of the PDF if no other range requests are being sent for the PDF file 如果没有其他范围请求发送给PDF文件,则获取PDF的剩余内容

(read something like that on some blog while searching in incognito, don't remember the URL for that blog, but the pdf.js wiki doesn't mention this on the website). (在隐身搜索时在某些博客上阅读类似内容,不记得该博客的URL,但pdf.js Wiki在网站上未提及此内容)。

EDIT: My PDF files are optimized, as per the pdfinfo utility . 编辑:根据pdfinfo实用程序对我的PDF文件进行了优化。

Pdf优化

The feature of requesting byte ranges is not meant for end users. 请求字节范围的功能不适用于最终用户。 It is an implicit requirement for the correct handling of 'linearized' PDFs (commonly also known as 'web optimized' PDFs). 这是正确处理“线性化” PDF(通常也称为“网络优化” PDF)的隐含要求。

Linearized/web-optimized PDFs can be checked for by this command, for example: 可以通过此命令检查线性化/经过网络优化的PDF,例如:

 pdfinfo filename.pdf | grep Optimized:

Linearized PDFs do have an internal structure that is a little bit different. 线性化PDF的内部结构确实有些不同。 Basically, they are made so that the conforming reader software does not need to download the complete file so it can access the trailer and xref table parts (which in standard PDFs always are at the end of the file). 基本上,这样做是为了使符合条件的阅读器软件无需下载完整的文件,因此可以访问trailerxref表部分(在标准PDF中,这些部分始终位于文件末尾)。

Trailer and X/Cross Reference Table (which is a sort of internal PDF 'ToC') are needed so the reader software is able to identify the location of the root object within the files, and from there, the pages and all the other objects. 需要Trailer和X / Cross Reference Table(一种内部PDF'ToC'),以便阅读器软件能够识别文件中根对象的位置,并从中识别页面和所有其他对象。

Instead, the reader gets told about the xref and root object locations by different means, and it can start to render the first page (whose objects need to be at the beginning of the file) already while the rest of file/objects/pages are still downloading. 取而代之的是,读者通过不同的方式xrefxref和根对象的位置,并且它可以开始渲染第一页(其对象必须位于文件的开头),而其余文件/对象/页面都位于该位置。仍在下载。

This means a user can then click on bookmarks, internal hyperlinks, or tell the reader "go to page 80" as soon as the first page is visible. 这意味着用户可以单击书签,内部超链接,或在第一页可见后立即告诉读者“转到第80页”。 The reader then knows from its already processed information which byte range it should request from the conforming web server. 然后,读者可以从已经处理的信息中知道应该从合格的Web服务器请求哪个字节范围。

Other questions: 其他问题:

  • No, in a 'standard' PDF the objects that are related to certain page are almost never contiguous (this would be a very rare exception). 不,在“标准” PDF中,与特定页面相关的对象几乎永远不会连续(这是非常罕见的例外)。

  • Yes, the web server needs to support byte range delivery ( 'byte serving' ) too. 是的,Web服务器也需要支持字节范围传递( “字节服务” )。 Yes, all modern web servers can be configured to support this. 是的,所有现代Web服务器都可以配置为支持此功能。

  • No, I'm not aware of any utility that reports to you the page range (in bytes) from a PDF (it would work for linearized PDFs only, if so). 不,我不知道有任何实用程序可以向您报告PDF的页面范围(以字节为单位)(如果适用,则仅适用于线性化PDF)。

TL;DR: Asking for byte range downloads in the context of a PDF is only ever reasonable, if your PDF document is 'web optimized' in the first place! TL; DR: 仅当您的PDF文档首先经过“网络优化”时,才有可能要求在PDF上下文中下载字节范围! (And requesting a certain byte range has to be done by the viewer , translating a user's request for a certain page into the correct range numbers...) (并且请求特定字节范围必须由查看器完成,将用户对特定页面的请求转换为正确的范围编号...)


Update 更新

Resources: 资源:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM