简体   繁体   中英

pdfjs can't view PDF when containing non standard characeters

I am trying to view a PDF using PDFJS. I have the following code which works fine for a demo PDF I got from the PDFJS website, however it doesn't work for other PDFs I have tried. Here is the raw text of the demo PDF that works:

%PDF-1.7
1 0 obj  % entry point
<</Type/Catalog/Pages 2 0 R>>
endobj
2 0 obj<</Type/Pages/MediaBox[ 0 0 200 200]/Count 1/Kids[3 0 R]>>endobj
3 0 obj<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 4 0 R>>>>/Contents 5 0 R>>endobj
4 0 obj<</Type/Font/Subtype/Type1/BaseFont/Times-Roman>>endobj
5 0 obj  % page content
<</Length 44>> stream
BT 70 50 TD /F1 12 Tf(Hello, world!) Tj ET
endstream endobj
xref trailer <</Size 6/Root 1 0 R>> startxref
%%EOF

And here is my html code that successfully loads the above PDF:

<script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.2.228/pdf.js"></script>
<input type="file" id="input"/> </br> <canvas id="can" width=1000 height=1000/>
<script>
    document.getElementById('input').addEventListener('change', function(e){
        var reader = new FileReader()
        reader.onload = function(x){
            window['pdfjs-dist/build/pdf'].getDocument({data:x.target.result}).promise.then(function(pdf){
                pdf.getPage(1).then(function(page){
                    page.render({canvasContext:document.getElementById('can').getContext('2d'),
                        viewport:page.getViewport({scale:1})})
        })})}
        reader.readAsText(e.target.files[0])
    }, false)
</script>

However, other PDFs of mine won't load at all. For example, I generated a 1 page PDF containing only the word 'TEST' on overleaf and downloaded it. When I tried uploading this PDF to my html code I got these errors in the console:

Warning: Invalid stream: "FormatError: Bad FCHECK in flate stream: 120, 253"
util.js:306 Warning: Indexing all PDF objects
2util.js:306 Warning: Invalid stream: "FormatError: Bad FCHECK in flate stream: 120, 253"
viewPDF.html:1 Uncaught (in promise) InvalidPDFException {name: "InvalidPDFException", message: "Invalid PDF structure"}
Promise.then (async)
reader.onload @ viewPDF.html:7
load (async)
(anonymous) @ viewPDF.html:6

I suspect the problem I am having are related to the fact that the PDFs that aren't working contain non standard characters. Here is the first few lines of the PDF from overleaf:

%PDF-1.5
%���
3 0 obj
<< /Linearized 1 /L 11602 /H [ 678 125 ] /O 7 /E 11072 /N 1 /T 11321 >>
endobj

4 0 obj
<< /Type /XRef /Length 51 /Filter /FlateDecode /DecodeParms << /Columns 4 /Predictor 12 >> /W [ 1 2 1 ] /Index [ 3 14 ] /Info 1 0 R /Root 5 0 R /Size 17 /Prev 11322                 /ID [<8f1689fb6a16051fd66ebeadaa364b8d><4a8030207ba6597007a967ed52a9309d>] >>
stream
x�cbd�g`b`8 $��XF@���*��    ��@�Y�����v�#�.
endstream
endobj

5 0 obj
<< /Pages 14 0 R /Type /Catalog >>
endobj
6 0 obj
<< /Filter /FlateDecode /S 36 /Length 48 >>
stream
x�c```e``Z��
            pe31
                B�����,��v�>aW�

Your outputting encoded binary streams as seen by those symbols, and as you make a PDF more complex they would be required more and more for math fonts, images and normal imbedded fonts. It is possible to output them in ascii code and be acceptable as long as all the outputs are indexed. Your overleaf code is also complicated more by output as WEB /Linearized.

The structure of a PDF is not simple and your minimal working example should look something more like this where an xref table is included.

%PDF-1.7
%µ¶

1 0 obj
<</Type/Catalog/Pages 2 0 R>>
endobj

2 0 obj
<</Type/Pages/MediaBox[0 0 200 200]/Count 1/Kids[3 0 R]>>
endobj

3 0 obj
<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 4 0 R>>>>/Contents 5 0 R>>
endobj

4 0 obj
<</Type/Font/Subtype/Type1/BaseFont/Times-Roman>>
endobj

5 0 obj
<</Length 63>>
stream
q
BT
-50 TL
/F1 12 Tf
1 0 0 1 70 50 Tm
(Hello, world!) Tj
ET
Q

endstream
endobj

xref
0 6
0000000000 65536 f 
0000000016 00000 n 
0000000062 00000 n 
0000000136 00000 n 
0000000227 00000 n 
0000000293 00000 n 

trailer
<</Size 6/Root 1 0 R>>
startxref
405
%%EOF

The main problem with this format is the decimal byte addresses need to be correct so different OS line endings between \\n \\r\\n and \\r in a large file can alter those values drastically, such that one byte wrong and the file is corrupted.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM