简体   繁体   English

javascript和字符串操作w / utf-16代理对

[英]javascript and string manipulation w/ utf-16 surrogate pairs

I'm working on a twitter app and just stumbled into the world of utf-8(16). 正在开发一个推特应用程序,偶然发现了utf-8(16)的世界。 It seems the majority of javascript string functions are as blind to surrogate pairs as I was. 似乎大多数javascript字符串函数对代理对都是盲目的。 I've got to recode some stuff to make it wide character aware. 我必须重新编码一些东西才能让它具有广泛的字符意识。

I've got this function to parse strings into arrays while preserving the surrogate pairs. 我有这个函数来解析字符串到数组,同时保留代理对。 Then I'll recode several functions to deal with the arrays rather than strings. 然后我将重新编码几个函数来处理数组而不是字符串。

function sortSurrogates(str){
  var cp = [];                 // array to hold code points
  while(str.length){           // loop till we've done the whole string
    if(/[\uD800-\uDFFF]/.test(str.substr(0,1))){ // test the first character
                               // High surrogate found low surrogate follows
      cp.push(str.substr(0,2)); // push the two onto array
      str = str.substr(2);     // clip the two off the string
    }else{                     // else BMP code point
      cp.push(str.substr(0,1)); // push one onto array
      str = str.substr(1);     // clip one from string 
    }
  }                            // loop
  return cp;                   // return the array
}

My question is, is there something simpler I'm missing? 我的问题是,有什么比我更缺的东西吗? I see so many people reiterating that javascript deals with utf-16 natively, yet my testing leads me to believe, that may be the data format, but the functions don't know it yet. 我看到很多人重申javascript本身处理utf-16,但我的测试让我相信,这可能是数据格式,但功能还不知道。 Am I missing something simple? 我错过了一些简单的事吗?

EDIT: To help illustrate the issue: 编辑:帮助说明问题:

var a = "0123456789"; // U+0030 - U+0039 2 bytes each
var b = "𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡"; // U+1D7D8 - U+1D7E1 4 bytes each
alert(a.length); // javascript shows 10
alert(b.length); // javascript shows 20

Twitter sees and counts both of those as being 10 characters long. Twitter看到并计算这两个长度为10个字符。

Javascript uses UCS-2 internally, which is not UTF-16. Javascript内部使用UCS-2,而不是UTF-16。 It is very difficult to handle Unicode in Javascript because of this, and I do not suggest attempting to do so. 因此,在Javascript中处理Unicode非常困难,我不建议尝试这样做。

As for what Twitter does, you seem to be saying that it is sanely counting by code point not insanely by code unit. 至于Twitter的作用,你似乎在说代码单元并不是疯狂地用代码点来计算。

Unless you have no choice, you should use a programming language that actually supports Unicode, and which has a code-point interface, not a code-unit interface. 除非你别无选择,否则你应该使用一种实际支持Unicode的编程语言,它具有代码点接口,而不是代码单元接口。 Javascript isn't good enough for that as you have discovered. 正如你所发现的,Javascript还不够好。

It has The UCS-2 Curse, which is even worse than The UTF-16 Curse, which is already bad enough. 它有UCS-2诅咒,甚至比UTF-16诅咒更糟糕已经足够糟糕了。 I talk about all this in OSCON talk, 🔫 Unicode Support Shootout: 👍 The Good, the Bad, & the (mostly) Ugly 👎 . 我在OSCON讲话中谈论所有这些, 🔫Unicode支持枪战:👍好,坏,和(大多)丑👎

Due to its horrible Curse, you have to hand-simulate UTF-16 with UCS-2 in Javascript, which is simply nuts. 由于它的可怕诅咒,你必须用Javascript中的UCS-2手工模拟UTF-16,这简直就是疯了。

Javascript suffers from all kinds of other terrible Unicode troubles, too. Javascript也遭受各种其他可怕的Unicode问题。 It has no support for graphemes or normalization or collation, all of which you really need. 它不支持字形或标准化或校对,所有这些都是你真正需要的。 And its regexes are broken, sometimes due to the Curse, sometimes just because people got it wrong. 它的正则表达式被打破了,有时候是因为诅咒,有时只是因为人们弄错了。 For example, Javascript is incapable of expressing regexes like [𝒜-𝒵] . 例如,Javascript无法表达像[𝒜-𝒵]这样的正则表达式。 Javascript doesn't even support casefolding, so you can't write a pattern like /ΣΤΙΓΜΑΣ/i and have it correctly match στιγμας . Javascript甚至不支持casefolding,所以你不能写像/ΣΤΙΓΜΑΣ/i这样的模式,并且正确匹配στιγμας

You can try to use the XRegEXp plugin , but you won't banish the Curse that way. 您可以尝试使用XRegEXp插件 ,但不会以这种方式消除诅咒。 Only changing to a language with Unicode support will do that, and 𝒥𝒶𝓋𝒶𝓈𝒸𝓇𝒾𝓅𝓉 just isn't one of those. 只有改为使用Unicode支持的语言才能做到这一点,而𝒥𝒶𝓋𝒶𝓈𝒸𝓇𝒾𝓅𝓉只是不是其中之一。

I've knocked together the starting point for a Unicode string handling object. 我把Unicode字符串处理对象的起点拼凑在了一起。 It creates a function called UnicodeString() that accepts either a JavaScript string or an array of integers representing Unicode code points and provides length and codePoints properties and toString() and slice() methods. 它创建了一个名为UnicodeString()的函数,它接受JavaScript字符串或表示Unicode代码点的整数数组,并提供lengthcodePoints属性以及toString()slice()方法。 Adding regular expression support would be very complicated, but things like indexOf() and split() (without regex support) should be pretty easy to implement. 添加正则表达式支持会非常复杂,但是indexOf()split() (没有正则表达式支持)之类的东西应该很容易实现。

 var UnicodeString = (function() { function surrogatePairToCodePoint(charCode1, charCode2) { return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000; } function stringToCodePointArray(str) { var codePoints = [], i = 0, charCode; while (i < str.length) { charCode = str.charCodeAt(i); if ((charCode & 0xF800) == 0xD800) { codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i))); } else { codePoints.push(charCode); } ++i; } return codePoints; } function codePointArrayToString(codePoints) { var stringParts = []; for (var i = 0, len = codePoints.length, codePoint, offset, codePointCharCodes; i < len; ++i) { codePoint = codePoints[i]; if (codePoint > 0xFFFF) { offset = codePoint - 0x10000; codePointCharCodes = [0xD800 + (offset >> 10), 0xDC00 + (offset & 0x3FF)]; } else { codePointCharCodes = [codePoint]; } stringParts.push(String.fromCharCode.apply(String, codePointCharCodes)); } return stringParts.join(""); } function UnicodeString(arg) { if (this instanceof UnicodeString) { this.codePoints = (typeof arg == "string") ? stringToCodePointArray(arg) : arg; this.length = this.codePoints.length; } else { return new UnicodeString(arg); } } UnicodeString.prototype = { slice: function(start, end) { return new UnicodeString(this.codePoints.slice(start, end)); }, toString: function() { return codePointArrayToString(this.codePoints); } }; return UnicodeString; })(); var ustr = UnicodeString("f𝌆𝌆bar"); document.getElementById("output").textContent = "String: '" + ustr + "', length: " + ustr.length + ", slice(2, 4): " + ustr.slice(2, 4); 
 <div id="output"></div> 

Here are a couple scripts that might be helpful when dealing with surrogate pairs in JavaScript: 以下是一些在JavaScript中处理代理项对时可能有用的脚本:

Javascript string iterators can give you the actual characters instead of the surrogate code points: Javascript字符串迭代器可以为您提供实际字符而不是代理代码点:

>>> [..."0123456789"]
["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
>>> [..."𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡"]
["𝟘", "𝟙", "𝟚", "𝟛", "𝟜", "𝟝", "𝟞", "𝟟", "𝟠", "𝟡"]
>>> [..."0123456789"].length
10
>>> [..."𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡"].length
10

This is along the lines of what I was looking for. 与我所寻找的一致。 It needs better support for the different string functions. 它需要更好地支持不同的字符串函数。 As I add to it I will update this answer. 当我添加它时,我将更新这个答案。

function wString(str){
  var T = this; //makes 'this' visible in functions
  T.cp = [];    //code point array
  T.length = 0; //length attribute
  T.wString = true; // (item.wString) tests for wString object

//member functions
  sortSurrogates = function(s){  //returns array of utf-16 code points
    var chrs = [];
    while(s.length){             // loop till we've done the whole string
      if(/[\uD800-\uDFFF]/.test(s.substr(0,1))){ // test the first character
                                 // High surrogate found low surrogate follows
        chrs.push(s.substr(0,2)); // push the two onto array
        s = s.substr(2);         // clip the two off the string
      }else{                     // else BMP code point
        chrs.push(s.substr(0,1)); // push one onto array
        s = s.substr(1);         // clip one from string 
      }
    }                            // loop
    return chrs;
  };
//end member functions

//prototype functions
  T.substr = function(start,len){
    if(len){
      return T.cp.slice(start,start+len).join('');
    }else{
      return T.cp.slice(start).join('');
    }
  };

  T.substring = function(start,end){
    return T.cp.slice(start,end).join('');
  };

  T.replace = function(target,str){
    //allow wStrings as parameters
    if(str.wString) str = str.cp.join('');
    if(target.wString) target = target.cp.join('');
    return T.toString().replace(target,str);
  };

  T.equals = function(s){
    if(!s.wString){
      s = sortSurrogates(s);
      T.cp = s;
    }else{
        T.cp = s.cp;
    }
    T.length = T.cp.length;
  };

  T.toString = function(){return T.cp.join('');};
//end prototype functions

  T.equals(str)
};

Test results: 检测结果:

// plain string
var x = "0123456789";
alert(x);                    // 0123456789
alert(x.substr(4,5))         // 45678
alert(x.substring(2,4))      // 23
alert(x.replace("456","x")); // 0123x789
alert(x.length);             // 10

// wString object
x = new wString("𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡");
alert(x);                    // 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡
alert(x.substr(4,5))         // 𝟜𝟝𝟞𝟟𝟠
alert(x.substring(2,4))      // 𝟚𝟛
alert(x.replace("𝟜𝟝𝟞","x")); // 𝟘𝟙𝟚𝟛x𝟟𝟠𝟡
alert(x.length);             // 10

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM