简体   繁体   中英

How to split Unicode string to characters in JavaScript

For long time we used naive approach to split strings in JS:

someString.split('');

But popularity of emoji forced us to change this approach - emoji characters (and other non-BMP characters) like are made of two "characters'.

String.fromCodePoint(128514).split(''); // array of 2 characters; can't embed due to StackOverflow limitations

So what is modern, correct and performant approach to this task?

Using spread in array literal :

 const str = "🌍🤖😸🎉"; console.log([...str]);

Using for...of :

 function split(str){ const arr = []; for(const char of str) arr.push(char) return arr; } const str = "🌍🤖😸🎉"; console.log(split(str));

The best approach to this task is to use native String.prototype[Symbol.iterator] that's aware of Unicode characters. Consequently clean and easy approach to split Unicode character is Array.from used on string, eg:

const string = String.fromCodePoint(128514, 32, 105, 32, 102, 101, 101, 108, 32, 128514, 32, 97, 109, 97, 122, 105, 110, 128514);
Array.from(string);

A flag was introduced in ECMA 2015 to support unicode awareness in regex.

Adding u to your regex returns the complete character in your result.

 const withFlag = `AB😂DE`.match(/./ug); const withoutFlag = `AB😂DE`.match(/./g); console.log(withFlag, withoutFlag);

There's a little more about it here

I did something like this somewhere I had to support older browsers and a ES5 minifier, probably will be useful to other

    if (Array.from && window.Symbol && window.Symbol.iterator) {
        array = Array.from(input[window.Symbol.iterator]());
    } else {
        array = ...; // maybe `input.split('');` as fallback if it doesn't matter
    }

JavaScript has a new API (part of ES2023) called Intl.Segmenter that allows you to split strings based on graphemes (the user-perceived characters of a string). With this API, your split might look like so:

 const split = (str) => { const itr = new Intl.Segmenter("en", {granularity: 'grapheme'}).segment(str); return Array.from(itr, ({segment}) => segment); } // See browser console for output console.log(split('')); // [''] console.log(split('é')); // ['é'] console.log(split('')); // [''] console.log(split('❤️')); // ['❤️'] console.log(split('♀️')); // ['♀️']
 <p>See browser console for logs</p>

This allows you to not only deal with emojis consisting of two code points such as, but other characters also such as composite characters (eg: ), characters separated by ZWJs (eg: ), characters with variation selectors (eg: ❤️), characters with emoji modifiers (eg: ♀️ ) etc. all of which can't be handled by invoking the iterator of strings (by using spread ... , for..of , Symbol.iterator etc.) as seen in the other answers, as these will only iterate the code points of your string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM