简体   繁体   English

base64编码的ajax损坏发送对此的一些想法

[英]Corruption of base64-encoded ajax sends -some thoughts on this

Been revisiting the question of why some kinds of character data are corrupted when sent via an ajax call to a webserver, no matter what encoding is used. 回顾了无论使用哪种编码方式,通过ajax调用发送到Web服务器时为什么某些字符数据都会损坏的问题。 Even if the data is precoded into a 7-bit format, what comes out is still not always equal to what went in. 即使将数据预编码为7位格式,输出的结果也不总是等于输入的结果。

I was using a third party javascript base64 encoder to prepare ajax data, and initally thought this had a bug. 我正在使用第三方javascript base64编码器来准备ajax数据,并且最初以为这是一个错误。 But, other base64 encoders show exactly the same problem -including one which claims full unicode compatibility- and there are several forum reports of similar problems, none of which seem to have been fully resolved. 但是,其他base64编码器也存在完全相同的问题-包括一个声称具有完全unicode兼容性的问题-并且有多个论坛报告了类似问题,但似乎都没有完全解决。 So, I don't think the encoder itself is at fault. 因此,我不认为编码器本身有问题。

I noticed that the corruption typically arises with data cut-and-pasted from other programs into CKEditor, if that data contains certain specific high-order ASCII/ANSI codes. 我注意到,损坏通常是由于从其他程序剪切并粘贴到CKEditor中的数据引起的,如果该数据包含某些特定的高阶ASCII / ANSI代码。

A few more tests seem to indicate that the problem has to do with some kind of discrepancy between the way javascript reads character data from a webpage, and the way it forms string data from internal programmatic methods, for example String.fromCharCode(). 还有更多测试表明,该问题与javascript从网页读取字符数据的方式与从内部编程方法(例如String.fromCharCode())形成字符串数据的方式之间存在某种差异有关。

In the snippet below, the handling of character 0x9E inserted into an HTML document by cut-and-paste from a text editor is compared with the same character generated programmatically from hex code 0x9E (U+017E - Arial Latin small z with caron, Windows Western charset). 在下面的代码段中,将通过文本编辑器的剪切和粘贴操作将插入到HTML文档中的字符0x9E与以十六进制代码0x9E编程生成的相同字符进行比较(U + 017E-Arial Latin small small with caron,Windows西方字符集)。 This is one of several character codes which have been seen to give rise to this anomalous behaviour. 这是已经发现会引起这种异常行为的几种字符代码之一。 Strangely, most other >127 character-codes give no such problems, and are rendered as two-byte unicode as they should be. 奇怪的是,大多数其他> 127个字符代码都没有出现此类问题,并应按应有的方式呈现为两字节unicode。

<script>
  var pasted_char = 'ž';
  alert("Pasted Character: " + pasted_char + " Resultant Code(s): " + charcodes(pasted_char));

  var charcode = 0x9E;
  var generated_char = String.fromCharCode(charcode);
  alert("Generated Character: " + generated_char + " Resultant Code(s): " + charcodes(generated_char));

function charcodes(invar) {
  // lists char codes for each byte in a character. 
  var ccodes = "~";
  for (ct=0; ct<invar.length; ct++){
    var invarc = invar.charCodeAt(ct);
    ccodes += invarc + "~";
  }
  return ccodes;
}
</script>

With a UTF-8 page charset, gives: 使用UTF-8页面字符集,可以:

Pasted Character: [0xFFFD] Resultant Code(s): ~65533~ 粘贴的字符:[0xFFFD]结果代码:〜65533〜

Generated Character: [blank] Resultant Code(s): ~158~ 生成的字符:[空白]结果代码:〜158〜

With a default page charset, gives: 使用默认的页面字符集,可以:

Pasted Character: ž Resultant Code(s): ~382~ 粘贴字符:ž结果代码:〜382〜

Generated Character: [blank] Resultant Code(s): ~158~ 生成的字符:[空白]结果代码:〜158〜

Notably, neither handling of the pasted character is correct, and there is no such ANSI code as 382! 值得注意的是,粘贴字符的处理均不正确,并且没有382这样的ANSI代码!

Both outputs are single byte. 两个输出均为单字节。

Strictly speaking this character is 8-bit ASCII/ANSI, which js does not claim to handle, however it is perfectly legitimate for it to be pasted into an HTML editor, for example from a text document. 严格来说,此字符是8位ASCII / ANSI,而js并未声明要处理该字符,但是将其粘贴到HTML编辑器(例如从文本文档中)是完全合法的。 Thus the javascript subsystem should be capable of handling such input without bugs arising. 因此,javascript子系统应该能够处理这种输入而不会产生错误。 It certainly seems to me, anyway, that generating the same character string in two different ways should not return two different results. 无论如何,在我看来,以两种不同的方式生成相同的字符串不应返回两个不同的结果。

Any thoughts on this would be welcome. 任何对此的想法都将受到欢迎。 I am not sure exactly what role this anomaly plays in corrupting the ajax send, but it seems likely it is the culprit. 我不确定此异常在破坏ajax发送中起什么作用,但似乎是罪魁祸首。

All Strings in JavaScript are in UTF-16 (and occasionally it's precursor USC-2 ), regardless of the character encoding of the page. JavaScript中的所有字符串都使用UTF-16 (有时是USC-2的前身),而不考虑页面的字符编码。 This is stated in section 8.4 of the ES5 specification , and section 8.5 in ES3. ES5规范的8.4节和ES3的8.5 对此进行了说明。 For common characters such as az etc, this has little effect on if you want ANSI or UTF-8 codes because they are the same, but this not true for all characters. 对于常见字符(例如az等),这对希望使用ANSIUTF-8代码的影响不大,因为它们是相同的,但对于所有字符而言并非如此。

If you want to generate ANSI , you will need a 256-item dictionary or some other logic for the character mappings. 如果要生成ANSI ,则将需要一个256个项的字典或其他用于字符映射的逻辑。


Here is such a table (without control chars) 这是一张这样的表(不带控制字符)

var ANSI = {
    " ": 32,
    "!": 33,
    "\"": 34,
    "#": 35,
    "$": 36,
    "%": 37,
    "&": 38,
    "'": 39,
    "(": 40,
    ")": 41,
    "*": 42,
    "+": 43,
    ",": 44,
    "-": 45,
    ".": 46,
    "/": 47,
    "0": 48,
    "1": 49,
    "2": 50,
    "3": 51,
    "4": 52,
    "5": 53,
    "6": 54,
    "7": 55,
    "8": 56,
    "9": 57,
    ":": 58,
    ";": 59,
    "<": 60,
    "=": 61,
    ">": 62,
    "?": 63,
    "@": 64,
    "A": 65,
    "B": 66,
    "C": 67,
    "D": 68,
    "E": 69,
    "F": 70,
    "G": 71,
    "H": 72,
    "I": 73,
    "J": 74,
    "K": 75,
    "L": 76,
    "M": 77,
    "N": 78,
    "O": 79,
    "P": 80,
    "Q": 81,
    "R": 82,
    "S": 83,
    "T": 84,
    "U": 85,
    "V": 86,
    "W": 87,
    "X": 88,
    "Y": 89,
    "Z": 90,
    "[": 91,
    "\\": 92,
    "]": 93,
    "^": 94,
    "_": 95,
    "`": 96,
    "a": 97,
    "b": 98,
    "c": 99,
    "d": 100,
    "e": 101,
    "f": 102,
    "g": 103,
    "h": 104,
    "i": 105,
    "j": 106,
    "k": 107,
    "l": 108,
    "m": 109,
    "n": 110,
    "o": 111,
    "p": 112,
    "q": 113,
    "r": 114,
    "s": 115,
    "t": 116,
    "u": 117,
    "v": 118,
    "w": 119,
    "x": 120,
    "y": 121,
    "z": 122,
    "{": 123,
    "|": 124,
    "}": 125,
    "~": 126,
    " ": 127,
    "€": 128,
    " ": 129,
    "‚": 130,
    "ƒ": 131,
    "„": 132,
    "…": 133,
    "†": 134,
    "‡": 135,
    "ˆ": 136,
    "‰": 137,
    "Š": 138,
    "‹": 139,
    "Œ": 140,
    " ": 141,
    "Ž": 142,
    "«": 143,
    " ": 144,
    "‘": 145,
    "’": 146,
    "“": 147,
    "”": 148,
    "•": 149,
    "–": 150,
    "—": 151,
    "˜": 152,
    "™": 153,
    "š": 154,
    "›": 155,
    "œ": 156,
    " ": 157,
    "ž": 158,
    "Ÿ": 159,
    " ": 160,
    "¡": 161,
    "¢": 162,
    "£": 163,
    "¤": 164,
    "¥": 165,
    "¦": 166,
    "§": 167,
    "¨": 168,
    "©": 169,
    "ª": 170,
    "«": 171,
    "¬": 172,
    "­": 173,
    "®": 174,
    "¯": 175,
    "°": 176,
    "±": 177,
    "²": 178,
    "³": 179,
    "´": 180,
    "µ": 181,
    "¶": 182,
    "·": 183,
    "¸": 184,
    "¹": 185,
    "º": 186,
    "»": 187,
    "¼": 188,
    "½": 189,
    "¾": 190,
    "¿": 191,
    "À": 192,
    "Á": 193,
    "Â": 194,
    "Ã": 195,
    "Ä": 196,
    "Å": 197,
    "Æ": 198,
    "Ç": 199,
    "È": 200,
    "É": 201,
    "Ê": 202,
    "Ë": 203,
    "Ì": 204,
    "Í": 205,
    "Î": 206,
    "Ï": 207,
    "Ð": 208,
    "Ñ": 209,
    "Ò": 210,
    "Ó": 211,
    "Ô": 212,
    "Õ": 213,
    "Ö": 214,
    "×": 215,
    "Ø": 216,
    "Ù": 217,
    "Ú": 218,
    "Û": 219,
    "Ü": 220,
    "Ý": 221,
    "Þ": 222,
    "ß": 223,
    "à": 224,
    "á": 225,
    "â": 226,
    "ã": 227,
    "ä": 228,
    "å": 229,
    "æ": 230,
    "ç": 231,
    "è": 232,
    "é": 233,
    "ê": 234,
    "ë": 235,
    "ì": 236,
    "í": 237,
    "î": 238,
    "ï": 239,
    "ð": 240,
    "ñ": 241,
    "ò": 242,
    "ó": 243,
    "ô": 244,
    "õ": 245,
    "ö": 246,
    "÷": 247,
    "ø": 248,
    "ù": 249,
    "ú": 250,
    "û": 251,
    "ü": 252,
    "ý": 253,
    "þ": 254,
    "ÿ": 255
};

I generated this with the following code applied to this page and copied and pasted here with some very minor modifications (escapes on \\ and " ), so you'll notice some characters didn't cross properly (notably the different types of space) and may need to be removed/deleted/modified before you can use it. You might also want to switch to the character encoding safe \\uXXXX format for the keys. 我是通过将以下代码应用于此页面生成的 ,并通过一些很小的修改(在\\"上转义" )将其复制并粘贴到此处,因此您会注意到某些字符无法正确交叉(特别是不同类型的空格),并且在使用前可能需要删除/删除/修改它,您可能还想切换到安全的\\uXXXX格式的字符编码。

var cells = document.getElementsByTagName('table')[0].getElementsByTagName('td'),
    a = [], i, j, k, v;
for (j = 0; j < 7; ++j) for (i = 7 + j; i < cells.length; i += 7) {
    k = cells[i].textContent.slice(-1);
    v = +cells[i].textContent.slice(0, 3).replace(/[^\d]/g, '');
    a.push('    "' + k + '": ' + v);
}
'{\n' + a.join(',\n') + '\n}';

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM