简体   繁体   English

包含非标准字符时,pdfjs 无法查看 PDF

[英]pdfjs can't view PDF when containing non standard characeters

I am trying to view a PDF using PDFJS.我正在尝试使用 PDFJS 查看 PDF。 I have the following code which works fine for a demo PDF I got from the PDFJS website, however it doesn't work for other PDFs I have tried.我有以下代码适用于我从 PDFJS 网站获得的演示 PDF,但它不适用于我尝试过的其他 PDF。 Here is the raw text of the demo PDF that works:这是有效的演示 PDF 的原始文本:

%PDF-1.7
1 0 obj  % entry point
<</Type/Catalog/Pages 2 0 R>>
endobj
2 0 obj<</Type/Pages/MediaBox[ 0 0 200 200]/Count 1/Kids[3 0 R]>>endobj
3 0 obj<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 4 0 R>>>>/Contents 5 0 R>>endobj
4 0 obj<</Type/Font/Subtype/Type1/BaseFont/Times-Roman>>endobj
5 0 obj  % page content
<</Length 44>> stream
BT 70 50 TD /F1 12 Tf(Hello, world!) Tj ET
endstream endobj
xref trailer <</Size 6/Root 1 0 R>> startxref
%%EOF

And here is my html code that successfully loads the above PDF:这是我成功加载上述 PDF 的 html 代码:

<script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.2.228/pdf.js"></script>
<input type="file" id="input"/> </br> <canvas id="can" width=1000 height=1000/>
<script>
    document.getElementById('input').addEventListener('change', function(e){
        var reader = new FileReader()
        reader.onload = function(x){
            window['pdfjs-dist/build/pdf'].getDocument({data:x.target.result}).promise.then(function(pdf){
                pdf.getPage(1).then(function(page){
                    page.render({canvasContext:document.getElementById('can').getContext('2d'),
                        viewport:page.getViewport({scale:1})})
        })})}
        reader.readAsText(e.target.files[0])
    }, false)
</script>

However, other PDFs of mine won't load at all.但是,我的其他 PDF 根本无法加载。 For example, I generated a 1 page PDF containing only the word 'TEST' on overleaf and downloaded it.例如,我生成了一个 1 页的 PDF,在背面只包含“测试”这个词并下载了它。 When I tried uploading this PDF to my html code I got these errors in the console:当我尝试将此 PDF 上传到我的 html 代码时,我在控制台中收到这些错误:

Warning: Invalid stream: "FormatError: Bad FCHECK in flate stream: 120, 253"
util.js:306 Warning: Indexing all PDF objects
2util.js:306 Warning: Invalid stream: "FormatError: Bad FCHECK in flate stream: 120, 253"
viewPDF.html:1 Uncaught (in promise) InvalidPDFException {name: "InvalidPDFException", message: "Invalid PDF structure"}
Promise.then (async)
reader.onload @ viewPDF.html:7
load (async)
(anonymous) @ viewPDF.html:6

I suspect the problem I am having are related to the fact that the PDFs that aren't working contain non standard characters.我怀疑我遇到的问题与不起作用的 PDF 包含非标准字符这一事实有关。 Here is the first few lines of the PDF from overleaf:这是来自背页的 PDF 的前几行:

%PDF-1.5
%���
3 0 obj
<< /Linearized 1 /L 11602 /H [ 678 125 ] /O 7 /E 11072 /N 1 /T 11321 >>
endobj

4 0 obj
<< /Type /XRef /Length 51 /Filter /FlateDecode /DecodeParms << /Columns 4 /Predictor 12 >> /W [ 1 2 1 ] /Index [ 3 14 ] /Info 1 0 R /Root 5 0 R /Size 17 /Prev 11322                 /ID [<8f1689fb6a16051fd66ebeadaa364b8d><4a8030207ba6597007a967ed52a9309d>] >>
stream
x�cbd�g`b`8 $��XF@���*��    ��@�Y�����v�#�.
endstream
endobj

5 0 obj
<< /Pages 14 0 R /Type /Catalog >>
endobj
6 0 obj
<< /Filter /FlateDecode /S 36 /Length 48 >>
stream
x�c```e``Z��
            pe31
                B�����,��v�>aW�

Your outputting encoded binary streams as seen by those symbols, and as you make a PDF more complex they would be required more and more for math fonts, images and normal imbedded fonts.您输出这些符号所看到的编码二进制流,并且随着您使 PDF 变得更加复杂,数学字体、图像和普通嵌入字体将越来越需要它们。 It is possible to output them in ascii code and be acceptable as long as all the outputs are indexed.可以将它们以 ascii 代码输出并且只要所有输出都被索引就可以接受。 Your overleaf code is also complicated more by output as WEB /Linearized.通过输出为 WEB /Linearized,您的背页代码也变得更加复杂。

The structure of a PDF is not simple and your minimal working example should look something more like this where an xref table is included. PDF 的结构并不简单,您的最小工作示例应该看起来更像这样,其中包含外部参照表。

%PDF-1.7
%µ¶

1 0 obj
<</Type/Catalog/Pages 2 0 R>>
endobj

2 0 obj
<</Type/Pages/MediaBox[0 0 200 200]/Count 1/Kids[3 0 R]>>
endobj

3 0 obj
<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 4 0 R>>>>/Contents 5 0 R>>
endobj

4 0 obj
<</Type/Font/Subtype/Type1/BaseFont/Times-Roman>>
endobj

5 0 obj
<</Length 63>>
stream
q
BT
-50 TL
/F1 12 Tf
1 0 0 1 70 50 Tm
(Hello, world!) Tj
ET
Q

endstream
endobj

xref
0 6
0000000000 65536 f 
0000000016 00000 n 
0000000062 00000 n 
0000000136 00000 n 
0000000227 00000 n 
0000000293 00000 n 

trailer
<</Size 6/Root 1 0 R>>
startxref
405
%%EOF

The main problem with this format is the decimal byte addresses need to be correct so different OS line endings between \\n \\r\\n and \\r in a large file can alter those values drastically, such that one byte wrong and the file is corrupted.这种格式的主要问题是十进制字节地址需要正确,因此大文件中 \\n \\r\\n 和 \\r 之间的不同操作系统行结尾可以彻底改变这些值,例如一个字节错误并且文件已损坏.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM