简体   繁体   English

Java String.getBytes(“UTF8”)JavaScript模拟

[英]Java String.getBytes(“UTF8”) JavaScript analog

Bytes to string and backward 字符串到字符串和向后

Functions written there work properly that is pack(unpack("string")) yields to "string" . 在那里写的函数正常工作,即pack(unpack("string"))产生"string" But I would like to have the same result as "string".getBytes("UTF8") gives in Java. 但是我希望得到与"string".getBytes("UTF8")相同的结果"string".getBytes("UTF8")在Java中给出。

The question is how to make a function giving the same functionality as Java getBytes("UTF8") in JavaScript? 问题是如何使函数在JavaScript中提供与Java getBytes(“UTF8”)相同的功能?

For Latin strings unpack(str) from the article mentioned above provides the same result as getBytes("UTF8") except it adds 0 for odd positions. 对于拉丁字符串,从上面提到的文章unpack(str)提供了与getBytes("UTF8")相同的结果,除了它为奇数位置添加0 But with non-Latin strings it works completely different as it seems to me. 但是对于非拉丁字符串,它在我看来完全不同。 Is there a way to work with string data in JavaScript like Java does? 有没有办法像Java一样使用JavaScript中的字符串数据?

You don't need to write a full-on UTF-8 encoder; 您不需要编写全功能的UTF-8编码器; there is a much easier JS idiom to convert a Unicode string into a string of bytes representing UTF-8 code units: 有一个更简单的JS习惯用法将Unicode字符串转换为表示UTF-8代码单元的字节串:

unescape(encodeURIComponent(str))

(This works because the odd encoding used by escape / unescape uses %xx hex sequences to represent ISO-8859-1 characters with that code, instead of UTF-8 as used by URI-component escaping. Similarly decodeURIComponent(escape(bytes)) goes in the other direction.) (这是因为escape / unescape使用的奇数编码使用%xx十六进制序列来表示带有该代码的ISO-8859-1字符,而不是URI组件转义所使用的UTF-8。类似地, decodeURIComponent(escape(bytes))走向另一个方向。)

So if you want an Array out it would be: 所以如果你想要一个数组,它将是:

function toUTF8Array(str) {
    var utf8= unescape(encodeURIComponent(str));
    var arr= new Array(utf8.length);
    for (var i= 0; i<utf8.length; i++)
        arr[i]= utf8.charCodeAt(i);
    return arr;
}

You can use this function ( gist ): 你可以使用这个功能( 要点 ):

function toUTF8Array(str) {
    var utf8 = [];
    for (var i=0; i < str.length; i++) {
        var charcode = str.charCodeAt(i);
        if (charcode < 0x80) utf8.push(charcode);
        else if (charcode < 0x800) {
            utf8.push(0xc0 | (charcode >> 6), 
                      0x80 | (charcode & 0x3f));
        }
        else if (charcode < 0xd800 || charcode >= 0xe000) {
            utf8.push(0xe0 | (charcode >> 12), 
                      0x80 | ((charcode>>6) & 0x3f), 
                      0x80 | (charcode & 0x3f));
        }
        else {
            // let's keep things simple and only handle chars up to U+FFFF...
            utf8.push(0xef, 0xbf, 0xbd); // U+FFFE "replacement character"
        }
    }
    return utf8;
}

Example of use: 使用示例:

>>> toUTF8Array("中€")
[228, 184, 173, 226, 130, 172]

If you want negative numbers for values over 127, like Java's byte-to-int conversion does, you have to tweak the constants and use 如果你想要超过127的值的负数,就像Java的byte-to-int转换一样,你必须调整常量并使用

            utf8.push(0xffffffc0 | (charcode >> 6), 
                      0xffffff80 | (charcode & 0x3f));

and

            utf8.push(0xffffffe0 | (charcode >> 12), 
                      0xffffff80 | ((charcode>>6) & 0x3f), 
                      0xffffff80 | (charcode & 0x3f));

TextEncoder is part of the Encoding Living Standard and according to the Encoding API entry from the Chromium Dashboard, it shipped in Firefox and will ship in Chrome 38. There is also a text-encoding polyfill available for other browsers. TextEncoder编码生活标准的一部分,根据Chromium Dashboard的Encoding API条目,它在Firefox中提供,将在Chrome 38中提供。还有一个文本编码的 polyfill可用于其他浏览器。

The JavaScript code sample below returns a Uint8Array filled with the values you expect. 下面的JavaScript代码示例返回一个Uint8Array填充了您期望的值。

(new TextEncoder()).encode("string") 
// [115, 116, 114, 105, 110, 103]

A more interesting example that betters shows UTF-8 replaces the in in string with îñ : 这更佳更有趣的例子表明UTF-8将替换instringîñ

(new TextEncoder()).encode("strîñg")
[115, 116, 114, 195, 174, 195, 177, 103]

The following function will deal with those above U+FFFF. 以下函数将处理U + FFFF以上的函数。

Because javascript text are in UTF-16, two "characters" are used in a string to represent a character above BMP, and charCodeAt returns the corresponding surrogate code. 因为javascript文本是UTF-16,所以在字符串中使用两个“字符”来表示BMP之上的字符,并且charCodeAt返回相应的代理代码。 The fixedCharCodeAt handles this. fixedCharCodeAt处理这个。

function encodeTextToUtf8(text) {
    var bin = [];
    for (var i = 0; i < text.length; i++) {
        var v = fixedCharCodeAt(text, i);
        if (v === false) continue;
        encodeCharCodeToUtf8(v, bin);
    }
    return bin;
}

function encodeCharCodeToUtf8(codePt, bin) {
    if (codePt <= 0x7F) {
        bin.push(codePt);
    } else if (codePt <= 0x7FF) {
        bin.push(192 | (codePt >> 6), 128 | (codePt & 63));
    } else if (codePt <= 0xFFFF) {
        bin.push(224 | (codePt >> 12),
            128 | ((codePt >> 6) & 63),
            128 | (codePt & 63));
    } else if (codePt <= 0x1FFFFF) {
        bin.push(240 | (codePt >> 18),
            128 | ((codePt >> 12) & 63), 
            128 | ((codePt >> 6) & 63),
            128 | (codePt & 63));
    }
}

function fixedCharCodeAt (str, idx) {  
    // ex. fixedCharCodeAt ('\uD800\uDC00', 0); // 65536  
    // ex. fixedCharCodeAt ('\uD800\uDC00', 1); // 65536  
    idx = idx || 0;  
    var code = str.charCodeAt(idx);  
    var hi, low;  
    if (0xD800 <= code && code <= 0xDBFF) { // High surrogate (could change last hex to 0xDB7F to treat high private surrogates as single characters)  
        hi = code;  
        low = str.charCodeAt(idx+1);  
        if (isNaN(low)) {  
            throw(encoding_error.invalid_surrogate_pair.replace('%pos%', idx));
        }  
        return ((hi - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000;  
    }  
    if (0xDC00 <= code && code <= 0xDFFF) { // Low surrogate  
        // We return false to allow loops to skip this iteration since should have already handled high surrogate above in the previous iteration  
        return false;  
        /*hi = str.charCodeAt(idx-1); 
          low = code; 
          return ((hi - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000;*/  
    }  
    return code;  
}  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM