简体   繁体   English

JavaScript toLowerCase奇怪的行为

[英]Javascript toLowerCase strange behaviour

I have a small application that reads tweets and tries to match keywords and I noticed this strange behaviour with a particular string: 我有一个小型应用程序,它读取推文并尝试匹配关键字,并且我注意到特定字符串的这种奇怪行为:

var text = "The Νіk​е D​un​k​ Ніgh ЅΒ 'Uglу Ѕwеаt​еr​' іѕ n​оw аvаіlаblе http://swoo.sh/IHVaTL";
var lowercase = text.toLowerCase()

Now the value of lowercase is: 现在小写的值是:

the νіk​е d​un​k​ ніgh ѕβ 'uglу ѕwеаt​еr​' іѕ n​оw аvаіlаblе http://swoo.sh/ihvatl 的sβ'ug sswеt'的v字样已经http://swoo.sh/ihvatl

So it seems like the string is in a weird format, I double checked some of the letters and found that: 因此,看起来该字符串的格式很奇怪,我仔细检查了一些字母,发现:

text.charAt(4)
>"N"
text.charCodeAt(5)
>925
'N'.charCodeAt(0)
>78

So even if it looks like a normal N, the unicode associated to it corresponds to 因此,即使看起来像正常的N,与之关联的unicode也对应于

0925 थ DEVANAGARI LETTER THA 0925थ德凡那加字母THA

according to the unicode chart 根据unicode图表

So I´ma bit puzzled about how this can happen, and if there is anyway to "convert" to the supposed real letter 因此,我对这种情况如何发生以及是否有任何“转换”为假定的真实字母感到困惑

There is a python library called unidecode that I've used to solve this problem in python before, it basically "flattens" unicode into ascii. 我以前曾使用过一个名为unidecode的python库来解决python中的这个问题,它基本上将unicode 压缩为ASCII。

A quick google reveals that a similar library is available for JavaScript. 一个快速的谷歌发现, 类似的库可用于JavaScript。

You can create a separate canvas with each Latin letter, upper case and lower case, to compare against. 您可以使用每个拉丁字母(大写和小写)创建一个单独的画布进行比较。 Each time you encounter a character that's not in the Latin-1 range, create a new canvas for it, and compare it against each Latin alphabet character using an image diff algorithm. 每次遇到不在Latin-1范围内的字符时,请为其创建一个新画布,然后使用图像差异算法将其与每个拉丁字母字符进行比较 Replace the non-Latin character with the closest match. 用最匹配的字符替换非拉丁字符。

For example: 例如:

var latinize = (function () {
    var latinLetters = [],
        canvases = [],
        size = 16,
        halfSize = size >> 1;

    function makeCanvas(chr) {
        var canvas = document.createElement('canvas'),
            context = canvas.getContext('2d');

        canvas.width = size;
        canvas.height = size;
        context.textBaseline = 'middle';
        context.textAlign = 'center';
        context.font = (halfSize) + "px sans-serif";
        context.fillText(chr, halfSize, halfSize);

        return context;
    }

    function nextChar(chr) {
        return String.fromCharCode(chr.charCodeAt(0) + 1);
    }

    function setupRange(from, to) {
        for (var chr = from; chr <= to; chr = nextChar(chr)) {
            latinLetters.push(chr);
            canvases.push(makeCanvas(chr));
        }
    }

    function calcDistance(ctxA, ctxB) {
        var distance = 0,
            dataA = ctxA.getImageData(0, 0, size, size).data,
            dataB = ctxB.getImageData(0, 0, size, size).data;

        for (var i = dataA.length; i--;) {
            distance += Math.abs(dataA[i] - dataB[i]);
        }

        return distance;
    }

    setupRange('a', 'z');
    setupRange('A', 'Z');
    setupRange('', ''); // ignore blank characters

    return function (text) {
        var result = "",
            scores, canvas;

        for (var i = 0; i < text.length; i++) {
            if (text.charCodeAt(i) < 128) {
                result += text.charAt(i);
                continue;
            }
            scores = [];
            canvas = makeCanvas(text.charAt(i));
            for (var j = 0; j < canvases.length; j++) {
                scores.push({
                    glyph: latinLetters[j],
                    score: calcDistance(canvas, canvases[j])
                });
            }
            scores.sort(function (a, b) {
                return a.score - b.score;
            });
            result += scores[0].glyph;
        }

        return result;
    }
}());

This translates your test string to "the nike dunk high sb 'ugly sweater' is now available". 这会将您的测试字符串转换为“现在可以使用nike dunk high sb'丑陋的毛衣”了。

The alternative is to create a giant data structure mapping all of the look-alike characters to their Latin-1 equivalents, as the library in @willy's answer does. 替代方法是创建一个巨大的数据结构,将所有相似字符映射为与Latin-1等效的字符,就像@willy答案中的库一样。 This is extremely heavy for "browser JavaScript", and probably not suitable for sending to the client, as you can see by looking at the source for that project. 对于“浏览器JavaScript”而言,这非常繁琐,并且可能不适合发送给客户端,如您通过查看该项目的源代码可以看到的那样。

http://jsfiddle.net/Ly5Lt/4/ http://jsfiddle.net/Ly5Lt/4/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM