Text to an Array Buffer causes files to be corrupted

Question

I have a sample, from it the user can select a file (PDF files in particular), convert that file to an array buffer, construct the file back from that array buffer and download that file. works as expected.

<input type="file" id="file_input" class="foo" />
<div id="output_field" class="foo"></div>


$(document).ready(function(){
    $('#file_input').on('change', function(e){
        readFile(this.files[0], function(e) {
            //manipulate with result...
            $('#output_field').text(e.target.result);
            try {           
            var file = new Blob([e.target.result], { type: 'application/pdf' });
            var fileURL = window.URL.createObjectURL(file);
            var seconds = new Date().getTime() / 1000;
            var fileName = "cert" + parseInt(seconds) + ".pdf";
            var a = document.createElement("a");
            document.body.appendChild(a);
            a.style = "display: none";
            a.href = fileURL;
            a.download = fileName;
            a.click();
             }
            catch (err){
            $('#output_field').text(err);
            }
        });     
    });
});

function readFile(file, callback){
    var reader = new FileReader();
    reader.onload = callback
    reader.readAsArrayBuffer(file);
}

Now let's say I used reader.readAsText(file); isntead of reader.readAsArrayBuffer(file); . In that case I would convert the text to an array buffer and try to do that same thing.

$(document).ready(function(){
    $('#file_input').on('change', function(e){
        readFile(this.files[0], function(e) {
            //manipulate with result...
            try {
            var buf = new ArrayBuffer(e.target.result.length * 2); 
            var bufView = new Uint16Array(buf);
            for (var i=0, strLen = e.target.result.length; i<strLen; i++) {
                     bufView[i] = e.target.result.charCodeAt(i);
            }

            var file = new Blob([bufView], { type: 'application/pdf' });
            var fileURL = window.URL.createObjectURL(file);
            var seconds = new Date().getTime() / 1000;
            var fileName = "cert" + parseInt(seconds) + ".pdf";
            var a = document.createElement("a");
            document.body.appendChild(a);
            a.style = "display: none";
            a.href = fileURL;
            a.download = fileName;
            a.click();
             }
            catch (err){
            $('#output_field').text(err);
            }
        });

    });
});

function readFile(file, callback){
    var reader = new FileReader();
    reader.onload = callback
    reader.readAsText(file);
}

Now if I passed a PDF file that is small in size and only has text, this would work file, but when selecting files that are large and/or has images in them, a currputed file will be downloaded.

Now I do know that I'm trying to make life harder for myself. But what I'm trying to do is somehow convert the result from readAsText() into an arrayBuffer so that both of readAsText() and readAsArrayBuffer() work identicaly.

Answer 1

The readAsText method doesn't simply make the bytes accessible in a UCS-16 string. Instead, it decodes them as text , according to a given text encoding format, by default UTF-8. This will mess with any binary data that you are trying to read. As you already figured out, use readAsArrayBuffer for that.

You can try to use a TextEncoder to encode your text back to a typed array, but that's not guaranteed to yield the same result: a BOM gets stripped, invalid UTF-8 sequences lead to errors, and if you're unlucky then even Unicode normalisation will happen.

It might get easier if you explicitly specify a single-byte decoding, but really you should just use readAsArrayBuffer .

Answer 2

As Bergi already have answered, you should use readAsArrayBuffer for binary data instead of readAsText , since the later decodes the byte sequences, by default as UTF-8.

UTF-8 is a variable length encoding, where a character can be between 1 and 4 bytes. Running the decoder on binary data that isn't UTF-8 will irrecoverable corrupt the binary data.

For example, only 0x00-0x7F is copied verbatim. 0xC2 to 0xDF is the start sequence of a 2 byte sequence, 0xE0 to 0xEF of a 3 byte sequence and 0xF0 to 0xFF of a 4 byte sequence. 0x80 to 0xBF is part of a sequence.

Here are a couple of examples of how it gets corrupted (node 12.1):

      ORIGINAL        =>  DECODED from UTF-8 to UCS-2  =>                 ENOCDED from UCS-2 to UTF-8
----------------------------------------------------------------------------------------------------------------------
[0xC2,0x80,0x80,0x80] => [0x0080,0xFFFD,0xFFFD]        => [0xC2,0x80,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD]
[0xC3,0x80,0x80,0x80] => [0x00C0,0xFFFD,0xFFFD]        => [0xC3,0x80,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD]
[0xE0,0x80,0x80,0x80] => [0xFFFD,0xFFFD,0xFFFD,0xFFFD] => [0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD]
[0xE1,0x80,0x80,0x80] => [0x1000,0xFFFD]               => [0xE1,0x80,0x80,0xEF,0xBF,0xBD]
[0xF0,0x80,0x80,0x80] => [0xFFFD,0xFFFD,0xFFFD,0xFFFD] => [0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD]
[0xF1,0x80,0x80,0x80] => [0xD8C0,0xDC00]               => [0xF1,0x80,0x80,0x80]
[0xF0,0x80,0x00,0x00] => [0xFFFD,0xFFFD,0x0000,0x0000] => [0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0x00,0x00]
[0x80,0x80,0x80,0x80] => [0xFFFD,0xFFFD,0xFFFD,0xFFFD] => [0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD]
[0x81,0x82,0x83,0x84] => [0xFFFD,0xFFFD,0xFFFD,0xFFFD] => [0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD]

0xFFFD is the Replacement Character that is used when the input can't be converted to a know codepoint.

Answer 3

It could be what I ran into long ago working with graphic files. Binary files are in specific format for a reason, and things like cr/lf might be legit in their own place. By reading a binary file as text and writing it back out, could actually throw in extra cr/lf per line thus throwing off the original format/content/pointers in the file.

To confirm this, I would take your original file, read/write as array buffer to one Test file, then do the same thing with read/write as text to a SecondTest file. Then do a binary compare between the two files.

I would bet you are getting extra stuff in there unintentionally.

Text to an Array Buffer causes files to be corrupted

Question

3 answers

solution1
3 2019-04-22 14:09:08

solution2
1 2019-04-29 12:06:23

solution3
-1 2019-04-22 13:17:51

Text to an Array Buffer causes files to be corrupted

Question

3 answers

solution1 3 2019-04-22 14:09:08

solution2 1 2019-04-29 12:06:23

solution3 -1 2019-04-22 13:17:51

solution1
3 2019-04-22 14:09:08

solution2
1 2019-04-29 12:06:23

solution3
-1 2019-04-22 13:17:51