简体   繁体   中英

Corruption of base64-encoded ajax sends -some thoughts on this

Been revisiting the question of why some kinds of character data are corrupted when sent via an ajax call to a webserver, no matter what encoding is used. Even if the data is precoded into a 7-bit format, what comes out is still not always equal to what went in.

I was using a third party javascript base64 encoder to prepare ajax data, and initally thought this had a bug. But, other base64 encoders show exactly the same problem -including one which claims full unicode compatibility- and there are several forum reports of similar problems, none of which seem to have been fully resolved. So, I don't think the encoder itself is at fault.

I noticed that the corruption typically arises with data cut-and-pasted from other programs into CKEditor, if that data contains certain specific high-order ASCII/ANSI codes.

A few more tests seem to indicate that the problem has to do with some kind of discrepancy between the way javascript reads character data from a webpage, and the way it forms string data from internal programmatic methods, for example String.fromCharCode().

In the snippet below, the handling of character 0x9E inserted into an HTML document by cut-and-paste from a text editor is compared with the same character generated programmatically from hex code 0x9E (U+017E - Arial Latin small z with caron, Windows Western charset). This is one of several character codes which have been seen to give rise to this anomalous behaviour. Strangely, most other >127 character-codes give no such problems, and are rendered as two-byte unicode as they should be.

<script>
  var pasted_char = 'ž';
  alert("Pasted Character: " + pasted_char + " Resultant Code(s): " + charcodes(pasted_char));

  var charcode = 0x9E;
  var generated_char = String.fromCharCode(charcode);
  alert("Generated Character: " + generated_char + " Resultant Code(s): " + charcodes(generated_char));

function charcodes(invar) {
  // lists char codes for each byte in a character. 
  var ccodes = "~";
  for (ct=0; ct<invar.length; ct++){
    var invarc = invar.charCodeAt(ct);
    ccodes += invarc + "~";
  }
  return ccodes;
}
</script>

With a UTF-8 page charset, gives:

Pasted Character: [0xFFFD] Resultant Code(s): ~65533~

Generated Character: [blank] Resultant Code(s): ~158~

With a default page charset, gives:

Pasted Character: ž Resultant Code(s): ~382~

Generated Character: [blank] Resultant Code(s): ~158~

Notably, neither handling of the pasted character is correct, and there is no such ANSI code as 382!

Both outputs are single byte.

Strictly speaking this character is 8-bit ASCII/ANSI, which js does not claim to handle, however it is perfectly legitimate for it to be pasted into an HTML editor, for example from a text document. Thus the javascript subsystem should be capable of handling such input without bugs arising. It certainly seems to me, anyway, that generating the same character string in two different ways should not return two different results.

Any thoughts on this would be welcome. I am not sure exactly what role this anomaly plays in corrupting the ajax send, but it seems likely it is the culprit.

All Strings in JavaScript are in UTF-16 (and occasionally it's precursor USC-2 ), regardless of the character encoding of the page. This is stated in section 8.4 of the ES5 specification , and section 8.5 in ES3. For common characters such as az etc, this has little effect on if you want ANSI or UTF-8 codes because they are the same, but this not true for all characters.

If you want to generate ANSI , you will need a 256-item dictionary or some other logic for the character mappings.


Here is such a table (without control chars)

var ANSI = {
    " ": 32,
    "!": 33,
    "\"": 34,
    "#": 35,
    "$": 36,
    "%": 37,
    "&": 38,
    "'": 39,
    "(": 40,
    ")": 41,
    "*": 42,
    "+": 43,
    ",": 44,
    "-": 45,
    ".": 46,
    "/": 47,
    "0": 48,
    "1": 49,
    "2": 50,
    "3": 51,
    "4": 52,
    "5": 53,
    "6": 54,
    "7": 55,
    "8": 56,
    "9": 57,
    ":": 58,
    ";": 59,
    "<": 60,
    "=": 61,
    ">": 62,
    "?": 63,
    "@": 64,
    "A": 65,
    "B": 66,
    "C": 67,
    "D": 68,
    "E": 69,
    "F": 70,
    "G": 71,
    "H": 72,
    "I": 73,
    "J": 74,
    "K": 75,
    "L": 76,
    "M": 77,
    "N": 78,
    "O": 79,
    "P": 80,
    "Q": 81,
    "R": 82,
    "S": 83,
    "T": 84,
    "U": 85,
    "V": 86,
    "W": 87,
    "X": 88,
    "Y": 89,
    "Z": 90,
    "[": 91,
    "\\": 92,
    "]": 93,
    "^": 94,
    "_": 95,
    "`": 96,
    "a": 97,
    "b": 98,
    "c": 99,
    "d": 100,
    "e": 101,
    "f": 102,
    "g": 103,
    "h": 104,
    "i": 105,
    "j": 106,
    "k": 107,
    "l": 108,
    "m": 109,
    "n": 110,
    "o": 111,
    "p": 112,
    "q": 113,
    "r": 114,
    "s": 115,
    "t": 116,
    "u": 117,
    "v": 118,
    "w": 119,
    "x": 120,
    "y": 121,
    "z": 122,
    "{": 123,
    "|": 124,
    "}": 125,
    "~": 126,
    " ": 127,
    "€": 128,
    " ": 129,
    "‚": 130,
    "ƒ": 131,
    "„": 132,
    "…": 133,
    "†": 134,
    "‡": 135,
    "ˆ": 136,
    "‰": 137,
    "Š": 138,
    "‹": 139,
    "Œ": 140,
    " ": 141,
    "Ž": 142,
    "«": 143,
    " ": 144,
    "‘": 145,
    "’": 146,
    "“": 147,
    "”": 148,
    "•": 149,
    "–": 150,
    "—": 151,
    "˜": 152,
    "™": 153,
    "š": 154,
    "›": 155,
    "œ": 156,
    " ": 157,
    "ž": 158,
    "Ÿ": 159,
    " ": 160,
    "¡": 161,
    "¢": 162,
    "£": 163,
    "¤": 164,
    "¥": 165,
    "¦": 166,
    "§": 167,
    "¨": 168,
    "©": 169,
    "ª": 170,
    "«": 171,
    "¬": 172,
    "­": 173,
    "®": 174,
    "¯": 175,
    "°": 176,
    "±": 177,
    "²": 178,
    "³": 179,
    "´": 180,
    "µ": 181,
    "¶": 182,
    "·": 183,
    "¸": 184,
    "¹": 185,
    "º": 186,
    "»": 187,
    "¼": 188,
    "½": 189,
    "¾": 190,
    "¿": 191,
    "À": 192,
    "Á": 193,
    "Â": 194,
    "Ã": 195,
    "Ä": 196,
    "Å": 197,
    "Æ": 198,
    "Ç": 199,
    "È": 200,
    "É": 201,
    "Ê": 202,
    "Ë": 203,
    "Ì": 204,
    "Í": 205,
    "Î": 206,
    "Ï": 207,
    "Ð": 208,
    "Ñ": 209,
    "Ò": 210,
    "Ó": 211,
    "Ô": 212,
    "Õ": 213,
    "Ö": 214,
    "×": 215,
    "Ø": 216,
    "Ù": 217,
    "Ú": 218,
    "Û": 219,
    "Ü": 220,
    "Ý": 221,
    "Þ": 222,
    "ß": 223,
    "à": 224,
    "á": 225,
    "â": 226,
    "ã": 227,
    "ä": 228,
    "å": 229,
    "æ": 230,
    "ç": 231,
    "è": 232,
    "é": 233,
    "ê": 234,
    "ë": 235,
    "ì": 236,
    "í": 237,
    "î": 238,
    "ï": 239,
    "ð": 240,
    "ñ": 241,
    "ò": 242,
    "ó": 243,
    "ô": 244,
    "õ": 245,
    "ö": 246,
    "÷": 247,
    "ø": 248,
    "ù": 249,
    "ú": 250,
    "û": 251,
    "ü": 252,
    "ý": 253,
    "þ": 254,
    "ÿ": 255
};

I generated this with the following code applied to this page and copied and pasted here with some very minor modifications (escapes on \\ and " ), so you'll notice some characters didn't cross properly (notably the different types of space) and may need to be removed/deleted/modified before you can use it. You might also want to switch to the character encoding safe \\uXXXX format for the keys.

var cells = document.getElementsByTagName('table')[0].getElementsByTagName('td'),
    a = [], i, j, k, v;
for (j = 0; j < 7; ++j) for (i = 7 + j; i < cells.length; i += 7) {
    k = cells[i].textContent.slice(-1);
    v = +cells[i].textContent.slice(0, 3).replace(/[^\d]/g, '');
    a.push('    "' + k + '": ' + v);
}
'{\n' + a.join(',\n') + '\n}';

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM