[英]javascript and string manipulation w/ utf-16 surrogate pairs
I'm working on a twitter app and just stumbled into the world of utf-8(16). 我正在开发一个推特应用程序,偶然发现了utf-8(16)的世界。 It seems the majority of javascript string functions are as blind to surrogate pairs as I was. 似乎大多数javascript字符串函数对代理对都是盲目的。 I've got to recode some stuff to make it wide character aware. 我必须重新编码一些东西才能让它具有广泛的字符意识。
I've got this function to parse strings into arrays while preserving the surrogate pairs. 我有这个函数来解析字符串到数组,同时保留代理对。 Then I'll recode several functions to deal with the arrays rather than strings. 然后我将重新编码几个函数来处理数组而不是字符串。
function sortSurrogates(str){
var cp = []; // array to hold code points
while(str.length){ // loop till we've done the whole string
if(/[\uD800-\uDFFF]/.test(str.substr(0,1))){ // test the first character
// High surrogate found low surrogate follows
cp.push(str.substr(0,2)); // push the two onto array
str = str.substr(2); // clip the two off the string
}else{ // else BMP code point
cp.push(str.substr(0,1)); // push one onto array
str = str.substr(1); // clip one from string
}
} // loop
return cp; // return the array
}
My question is, is there something simpler I'm missing? 我的问题是,有什么比我更缺的东西吗? I see so many people reiterating that javascript deals with utf-16 natively, yet my testing leads me to believe, that may be the data format, but the functions don't know it yet. 我看到很多人重申javascript本身处理utf-16,但我的测试让我相信,这可能是数据格式,但功能还不知道。 Am I missing something simple? 我错过了一些简单的事吗?
EDIT: To help illustrate the issue: 编辑:帮助说明问题:
var a = "0123456789"; // U+0030 - U+0039 2 bytes each
var b = "𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡"; // U+1D7D8 - U+1D7E1 4 bytes each
alert(a.length); // javascript shows 10
alert(b.length); // javascript shows 20
Twitter sees and counts both of those as being 10 characters long. Twitter看到并计算这两个长度为10个字符。
Javascript uses UCS-2 internally, which is not UTF-16. Javascript内部使用UCS-2,而不是UTF-16。 It is very difficult to handle Unicode in Javascript because of this, and I do not suggest attempting to do so. 因此,在Javascript中处理Unicode非常困难,我不建议尝试这样做。
As for what Twitter does, you seem to be saying that it is sanely counting by code point not insanely by code unit. 至于Twitter的作用,你似乎在说代码单元并不是疯狂地用代码点来计算。
Unless you have no choice, you should use a programming language that actually supports Unicode, and which has a code-point interface, not a code-unit interface. 除非你别无选择,否则你应该使用一种实际支持Unicode的编程语言,它具有代码点接口,而不是代码单元接口。 Javascript isn't good enough for that as you have discovered. 正如你所发现的,Javascript还不够好。
It has The UCS-2 Curse, which is even worse than The UTF-16 Curse, which is already bad enough. 它有UCS-2诅咒,甚至比UTF-16诅咒更糟糕,已经足够糟糕了。 I talk about all this in OSCON talk, 🔫 Unicode Support Shootout: 👍 The Good, the Bad, & the (mostly) Ugly 👎 . 我在OSCON讲话中谈论所有这些, 🔫Unicode支持枪战:👍好,坏,和(大多)丑👎 。
Due to its horrible Curse, you have to hand-simulate UTF-16 with UCS-2 in Javascript, which is simply nuts. 由于它的可怕诅咒,你必须用Javascript中的UCS-2手工模拟UTF-16,这简直就是疯了。
Javascript suffers from all kinds of other terrible Unicode troubles, too. Javascript也遭受各种其他可怕的Unicode问题。 It has no support for graphemes or normalization or collation, all of which you really need. 它不支持字形或标准化或校对,所有这些都是你真正需要的。 And its regexes are broken, sometimes due to the Curse, sometimes just because people got it wrong. 它的正则表达式被打破了,有时候是因为诅咒,有时只是因为人们弄错了。 For example, Javascript is incapable of expressing regexes like [𝒜-𝒵]
. 例如,Javascript无法表达像[𝒜-𝒵]
这样的正则表达式。 Javascript doesn't even support casefolding, so you can't write a pattern like /ΣΤΙΓΜΑΣ/i
and have it correctly match στιγμας . Javascript甚至不支持casefolding,所以你不能写像/ΣΤΙΓΜΑΣ/i
这样的模式,并且正确匹配στιγμας 。
You can try to use the XRegEXp plugin , but you won't banish the Curse that way. 您可以尝试使用XRegEXp插件 ,但不会以这种方式消除诅咒。 Only changing to a language with Unicode support will do that, and 𝒥𝒶𝓋𝒶𝓈𝒸𝓇𝒾𝓅𝓉 just isn't one of those. 只有改为使用Unicode支持的语言才能做到这一点,而𝒥𝒶𝓋𝒶𝓈𝒸𝓇𝒾𝓅𝓉只是不是其中之一。
I've knocked together the starting point for a Unicode string handling object. 我把Unicode字符串处理对象的起点拼凑在了一起。 It creates a function called UnicodeString()
that accepts either a JavaScript string or an array of integers representing Unicode code points and provides length
and codePoints
properties and toString()
and slice()
methods. 它创建了一个名为UnicodeString()
的函数,它接受JavaScript字符串或表示Unicode代码点的整数数组,并提供length
和codePoints
属性以及toString()
和slice()
方法。 Adding regular expression support would be very complicated, but things like indexOf()
and split()
(without regex support) should be pretty easy to implement. 添加正则表达式支持会非常复杂,但是indexOf()
和split()
(没有正则表达式支持)之类的东西应该很容易实现。
var UnicodeString = (function() { function surrogatePairToCodePoint(charCode1, charCode2) { return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000; } function stringToCodePointArray(str) { var codePoints = [], i = 0, charCode; while (i < str.length) { charCode = str.charCodeAt(i); if ((charCode & 0xF800) == 0xD800) { codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i))); } else { codePoints.push(charCode); } ++i; } return codePoints; } function codePointArrayToString(codePoints) { var stringParts = []; for (var i = 0, len = codePoints.length, codePoint, offset, codePointCharCodes; i < len; ++i) { codePoint = codePoints[i]; if (codePoint > 0xFFFF) { offset = codePoint - 0x10000; codePointCharCodes = [0xD800 + (offset >> 10), 0xDC00 + (offset & 0x3FF)]; } else { codePointCharCodes = [codePoint]; } stringParts.push(String.fromCharCode.apply(String, codePointCharCodes)); } return stringParts.join(""); } function UnicodeString(arg) { if (this instanceof UnicodeString) { this.codePoints = (typeof arg == "string") ? stringToCodePointArray(arg) : arg; this.length = this.codePoints.length; } else { return new UnicodeString(arg); } } UnicodeString.prototype = { slice: function(start, end) { return new UnicodeString(this.codePoints.slice(start, end)); }, toString: function() { return codePointArrayToString(this.codePoints); } }; return UnicodeString; })(); var ustr = UnicodeString("f𝌆𝌆bar"); document.getElementById("output").textContent = "String: '" + ustr + "', length: " + ustr.length + ", slice(2, 4): " + ustr.slice(2, 4);
<div id="output"></div>
Here are a couple scripts that might be helpful when dealing with surrogate pairs in JavaScript: 以下是一些在JavaScript中处理代理项对时可能有用的脚本:
ES6 Unicode shims for ES3+ adds the String.fromCodePoint
and String.prototype.codePointAt
methods from ECMAScript 6. The ES3/5 fromCharCode
and charCodeAt
methods do not account for surrogate pairs and therefore give wrong results. 用于ES3 +的ES6 Unicode fromCharCode
添加了ECMAScript 6中的String.fromCodePoint
和String.prototype.codePointAt
方法。来自fromCharCode
和charCodeAt
方法的ES3 / 5不考虑代理对,因此给出了错误的结果。
Full 21-bit Unicode code point matching in XRegExp with \\u{10FFFF}
allows matching any individual code point in XRegExp regexes. XRegExp中与\\u{10FFFF}
匹配的完整21位Unicode代码点允许匹配XRegExp正则表达式中的任何单个代码点。
Javascript string iterators can give you the actual characters instead of the surrogate code points: Javascript字符串迭代器可以为您提供实际字符而不是代理代码点:
>>> [..."0123456789"]
["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
>>> [..."𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡"]
["𝟘", "𝟙", "𝟚", "𝟛", "𝟜", "𝟝", "𝟞", "𝟟", "𝟠", "𝟡"]
>>> [..."0123456789"].length
10
>>> [..."𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡"].length
10
This is along the lines of what I was looking for. 这与我所寻找的一致。 It needs better support for the different string functions. 它需要更好地支持不同的字符串函数。 As I add to it I will update this answer. 当我添加它时,我将更新这个答案。
function wString(str){
var T = this; //makes 'this' visible in functions
T.cp = []; //code point array
T.length = 0; //length attribute
T.wString = true; // (item.wString) tests for wString object
//member functions
sortSurrogates = function(s){ //returns array of utf-16 code points
var chrs = [];
while(s.length){ // loop till we've done the whole string
if(/[\uD800-\uDFFF]/.test(s.substr(0,1))){ // test the first character
// High surrogate found low surrogate follows
chrs.push(s.substr(0,2)); // push the two onto array
s = s.substr(2); // clip the two off the string
}else{ // else BMP code point
chrs.push(s.substr(0,1)); // push one onto array
s = s.substr(1); // clip one from string
}
} // loop
return chrs;
};
//end member functions
//prototype functions
T.substr = function(start,len){
if(len){
return T.cp.slice(start,start+len).join('');
}else{
return T.cp.slice(start).join('');
}
};
T.substring = function(start,end){
return T.cp.slice(start,end).join('');
};
T.replace = function(target,str){
//allow wStrings as parameters
if(str.wString) str = str.cp.join('');
if(target.wString) target = target.cp.join('');
return T.toString().replace(target,str);
};
T.equals = function(s){
if(!s.wString){
s = sortSurrogates(s);
T.cp = s;
}else{
T.cp = s.cp;
}
T.length = T.cp.length;
};
T.toString = function(){return T.cp.join('');};
//end prototype functions
T.equals(str)
};
Test results: 检测结果:
// plain string
var x = "0123456789";
alert(x); // 0123456789
alert(x.substr(4,5)) // 45678
alert(x.substring(2,4)) // 23
alert(x.replace("456","x")); // 0123x789
alert(x.length); // 10
// wString object
x = new wString("𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡");
alert(x); // 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡
alert(x.substr(4,5)) // 𝟜𝟝𝟞𝟟𝟠
alert(x.substring(2,4)) // 𝟚𝟛
alert(x.replace("𝟜𝟝𝟞","x")); // 𝟘𝟙𝟚𝟛x𝟟𝟠𝟡
alert(x.length); // 10
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.