简体   繁体   中英

How to iterate over over all Unicode characters?

Is it possible to iterate over all Unicode characters (UTF-8)? Thanks! I've tried using:

character = String.fromCharCode(i);

But I'm not sure how to implement it.

UTF-8 is an encoding! JavaScript strings are (mostly) encoded in UTF-16. Encoding is only important if you're working in an environment that doesn't support ES6's String.fromCodePoint . Getting a string from a codepoint with ES6:

var s = String.fromCodePoint(codePoint);

and without ES6, using a UTF-16 surrogate pair for characters U+10000 and onwards:

var s;

if (codePoint < 0x10000) {
    s = String.fromCharCode(codePoint);
} else {
    var offset = codePoint - 0x10000;
    s = String.fromCharCode(0xd800 + (offset >> 10),
                            0xdc00 + (offset & 0x3ff));
}

Codepoints range from U+0000 to U+10FFFF (1 114 112 values), but not everything that range is a valid Unicode character. You can get a table from http://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt and extract the characters you really want to iterate over.

According to the docs , the parameter passed to String.fromCharCode(a) is converted calling ToUint16 and then said character is returned. You may call it with any number you want but the values will be capped to between 0 and 2 16 or 2 32

highNumber = 500; //This could go very high
out = ""
for(i=0;i<highNumber;i++){
    out += String.fromCharCode(i);
}
console.log(out);

Danger note if you run this code using 2^16 you may freeze your tab or browser, it's way too big. This is understanding you want to iterate over all characters and not all characters in a given string which is quite a different thing.

A sample output of a more reasonable highNumber (ie 500) is the following:

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
stuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæç
èéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺ
ĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƀƁƂƃƄƅƆƇƈƉƊƋƌƍ
ƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿǀǁǂǃDŽDždžLJLjljNJNjnjǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟǠ
ǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǰDZDzdz

(Adding this answer because relevant for some Google searches)

The correct way to iterate character by character over a string that may contain UTF-8 multi-codepoint characters (ie emojis or non-latin alphabets) is Array.from() :

const bugs = '🐛🐛🐛'

// WRONG, does not account for characters with > 2 Unicode code points
bugs.split('')
// Array(6) [ "\ud83d", "\udc1b", "\ud83d", "\udc1b", "\ud83d", "\udc1b" ]

// CORRECT
Array.from(bugs)
// Array(3) [ "🐛", "🐛", "🐛" ]

Then, iterate as you may iterate any normal array (suggested: map / forEach ).

More information: https://medium.com/@giltayar/iterating-over-emoji-characters-the-es6-way-f06e4589516

I think this might define what to iterate over exactly:

在此处输入图片说明

A Javascript string has a length property. You can iterate over the characters simply:

for(var i = 0; i < str.length; i++) {
    var char = str[i],
       code = str.charCodeAt(i);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM