简体   繁体   English

如何判断一个字符串是否包含 Javascript 中的多字节字符?

[英]How can I tell if a string contains multibyte characters in Javascript?

Is it possible in Javascript to detect if a string contains multibyte characters? Javascript 中是否可以检测字符串是否包含多字节字符? If so, is it possible to tell which ones?如果是这样,是否可以分辨出哪些?

The problem I'm running into is this (apologies if the Unicode char doesn't show up right for you)我遇到的问题是这个(如果 Unicode char 不适合您,我们深表歉意)

s = "𝌆";

alert(s.length);    // '2'
alert(s.charAt(0)); // '��'
alert(s.charAt(1)); // '��'

Edit for a bit of clarity here (I hope) .在这里编辑一下(我希望) As I understand it now , all strings in Javascript are represented as a series of UTF-16 code points, which means that regular characters actually take up 2 bytes (16 bits), so my usage of "multibyte" in the title was a bit off.据我现在了解,Javascript 中的所有字符串都表示为一系列 UTF-16 代码点,这意味着常规字符实际上占用 2 个字节(16 位),所以我在标题中使用“多字节”有点离开。 Some characters do not fall in the Basic Multilingual Plane (BMP), such as the string in the example above, and so they take up two code points (32 bits).有些字符不属于基本多语言平面 (BMP),例如上面示例中的字符串,因此它们占用两个代码点(32 位)。 That is the question I was asking.这就是我要问的问题。 I'm also not editing the original title, since to someone who doesn't know much about this stuff (and hence would be searching SO for info about it), "multibyte" would make sense.我也没有编辑原始标题,因为对于不太了解这些东西的人(因此会搜索有关它的信息),“多字节”是有意义的。

JavaScript strings are UCS-2 encoded but can represent Unicode code points outside the Basic Multilingual Pane ( U+0000 - U+D7FF and U+E000 - U+FFFF ) using two 16 bit numbers (a UTF-16 surrogate pair ), the first of which must be in the range U+D800 - U+DFFF . JavaScript字符串是UCS-2编码的,但可以使用两个16位数字(UTF-16 代理对 )表示基本多语言窗格( U+0000 - U+D7FFU+E000 - U+FFFF )之外的Unicode代码点,首先必须在U+D800 - U+DFFF范围内。

Based on this, it's easy to detect whether a string contains any characters that lie outside the Basic Multilingual Plane (which is what I think you're asking: you want to be able to identify whether a string contains any characters that lie outside the range of code points that JavaScript represents as a single character): 基于此,很容易检测字符串是否包含任何位于基本多语言平面之外的字符(我认为您要问的是:您希望能够识别字符串是否包含超出范围的任何字符JavaScript表示为单个字符的代码点):

function containsSurrogatePair(str) {
    return /[\uD800-\uDFFF]/.test(str);
}

alert( containsSurrogatePair("foo") ); // false
alert( containsSurrogatePair("f𝌆") ); // true

Working out precisely which code points are contained in your string is a little harder and requires a UTF-16 decoder. 精确计算字符串中包含哪些代码点要困难一点,并且需要UTF-16解码器。 The following will convert a string into an array of Unicode code points: 以下内容将字符串转换为Unicode代码点数组:

var getStringCodePoints = (function() {
    function surrogatePairToCodePoint(charCode1, charCode2) {
        return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000;
    }

    // Read string in character by character and create an array of code points
    return function(str) {
        var codePoints = [], i = 0, charCode;
        while (i < str.length) {
            charCode = str.charCodeAt(i);
            if ((charCode & 0xF800) == 0xD800) {
                codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i)));
            } else {
                codePoints.push(charCode);
            }
            ++i;
        }
        return codePoints;
    }
})();

alert( getStringCodePoints("f𝌆").join(",") ); // 102,119558

Using more modern Javascript syntax (Chrome 46+):使用更现代的 Javascript 语法(Chrome 46+):

const isMultiByte = string =>
  [...string].some(c => c.codePointAt(0) > 255)

Examples:例子:

isMultiByte("hi") -> false
isMultiByte("hiÿ") -> false // char code 255, small letter y with diaeresis
isMultiByte("こ") -> true

To find the multi-byte characters, change .some to .filter :要查找多字节字符, .some .some 更改为.filter

const getMultiByteChars = string =>
  [...string].filter(c => c.codePointAt(0) > 255)

Example:例子:

getMultiByteChars("こydwdこ") -> ['こ', 'こ']

If you want to eliminate duplicates:如果要消除重复项:

const getUniqueMultiByteChars = string =>
  [...string]
    .filter(c => c.codePointAt(0) > 255)
    .reduce((uniq, c) => (
      uniq.includes(c) ? uniq : [...uniq, c]
    ), [])

For positions of multi-byte characters:对于多字节字符的位置:

const getMultiByteCharsPos = string =>
  [...string].reduce((idxs, c, idx) => (
    c.codePointAt(0) > 255 ? [...idxs, idx] : idxs
  ), [])

Example:例子:

getMultiByteCharsPos("こydwdこ") -> [0, 5]

Note: This doesn't work in IE, no String.CodePointAt(n) .注意:这在 IE 中不起作用,没有String.CodePointAt(n) MS has officially EOL'ed Internet Explorer at the time of posting.在发帖时,MS 已正式将 Internet Explorer 停产。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在不破坏多字节字符的情况下截断 JavaScript 中的 UTF8 字符串? - How to truncate UTF8 string in JavaScript without breaking multibyte characters? 如何判断一个字符串是否包含JavaScript中的某个字符? - How to tell if a string contains a certain character in JavaScript? 如何检查字符串是否包含字符和空格,而不仅仅是空格? - How can I check if string contains characters & whitespace, not just whitespace? 如何判断 javascript 字符串是数字还是日期? - How can I tell if a javascript string is a number OR a date? JavaScript:字符串包含多少个唯一字符? - JavaScript: How many unique characters the string contains? 如何判断字符串中是否包含任何非 ASCII 字符? - How can I tell if a string has any non-ASCII characters in it? 如何检查文件是否包含 JavaScript 中的字符串或变量? - How can I check if a file contains a string or a variable in JavaScript? 如何使用javascript确定字符串是否只包含空格? - How can I determine if a string only contains spaces, using javascript? 如何检查NS String是否包含javascript函数? - How can I check if NS String contains a javascript function? 如何在 Javascript 中获取包含 substring 的完整字符串? - How can I get a full string that contains a substring in Javascript?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM