这些Unicode组合字符有什么用处，我们如何过滤它们？

Question

กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิก้้้้้้้้้้้้้้้้้้้้ก็็็็็็็็็็็็็็็็็็็็ก็็็็็็็็็็็็็็็็็็็็กิิิิิิิิิิิิิิิิิิิิก้้้้้้้้้้้้้้้้้้้้ก็็็็็็็็็็็็็็็็็็็็กิิิิิิิิิิิิิิิิิิิิก้้้้้้้้้้้้้้้้้้้้กิิิิิิิิิิิิิิิิิิิิก้้้้้้้้้้้้้้้้้้้้ก็็็็็็็็็็็็็็็็็็็็ก็็็็็็็็็็็็็็็็็็็็กิิิิิิิิิิิิิิิิิิิิก้้้้้้้้้้้้้้้้้้้้ก็็็็็็็็็็็็ ็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ็็็็็็็กิิิิิิิิิิิิิิิิิิิิก้้้้้้้้้้้้้้้้้้้้

These recently showed up in facebook comment sections. 这些最近出现在Facebook评论部分。

How can we sanitize this? 我们怎样才能消毒这个？

Answer 1

What's up with these unicode characters? 这些unicode角色怎么了？

That's a character with a series of combining characters . 这是一个具有一系列组合字符的角色 。 Because the combining characters in question want to go above the base character, they stack up (literally). 因为所讨论的组合字符想要超过基本字符，所以它们会叠加（字面意思）。 For instance, the case of 例如，的情况

ก้้้้้้้้้้้้้้้้้้้้ ก้้้้้้้้้้้้้้้้้้้้

...it's an ก (Thai character ko kai ) ( U+0E01 ) followed by 20 copies of the Thai combining character mai tho ( U+0E49 ). ...它是一个ก（泰国字符ko kai ）（ U + 0E01 ），后面是20个泰国组合字符mai tho （ U + 0E49 ）。

How can we sanitize this? 我们怎样才能消毒这个？

You could pre-process the text and limit the number of combining characters that can be applied to a single character, but the effort may not be worth the reward. 您可以预处理文本并限制可以应用于单个角色的组合字符数，但这些努力可能不值得奖励。 You'd need the data sheets for all the current characters so you'd know whether they were combining or what, and you'd need to be sure to allow at least a few because some languages are written with several diacritics on a single base. 您需要所有当前字符的数据表，以便您知道它们是组合还是组合，并且您需要确保至少允许一些，因为某些语言是在一个基础上使用多个变音符号编写的。 Now, if you want to limit comments to the Latin character set, that would be an easier range check, but of course that's only an option if you want to limit comments to just a few languages. 现在，如果您想将注释限制为拉丁字符集，那么这将是一个更容易的范围检查，但当然，如果您想将注释限制为几种语言，那么这只是一个选项。 More information, code sheets, etc. at unicode.org . 有关unicode.org的更多信息，代码表等。

BTW, if you ever want to know how some character was composed, for another question just recently I coded up a quick-and-dirty "Unicode Show Me" page on JSBin. 顺便说一句，如果你想知道某个角色是如何编写的，那么就在最近的另一个问题上，我在JSBin上编写了一个快速而又脏的“Unicode Show Me”页面。 You just copy and paste the text into the text area, and it shows you all of the code points (~characters) that the text is made up of, with links such as those above to the page describing each character. 您只需将文本复制并粘贴到文本区域，它就会显示文本所构成的所有代码点（〜字符），以及描述每个字符的页面上方的链接。 It only works for code points in the range U+FFFF and under, because it's written in JavaScript and to handle characters above U+FFFF in JavaScript you have to do more work than I wanted to do for that question (because in JavaScript, a "character" is always 16 bits, which means for some languages a character can be split across two separate JavaScript "characters" and I didn't account for that), but it's handy for most texts... 它只适用于U + FFFF及以下范围内的代码点，因为它是用JavaScript编写的，并且为了处理JavaScript中U + FFFF以上的字符，你必须完成比我想要做的更多的工作（因为在JavaScript中， “character” 总是 16位，这意味着对于某些语言，一个字符可以分成两个独立的JavaScript“字符”，而我没有考虑到这一点），但它对大多数文本来说都很方便...

Answer 2

If you have a regex engine with decent Unicode support, it's trivial to sanitize this kind of strings. 如果你有一个具有良好的Unicode支持的正则表达式引擎，那么清理这种字符串是微不足道的。 In Perl, for example, you can remove all but the first combining mark from every (user-perceived) character like this: 例如，在Perl中，您可以从每个（用户感知的）字符中删除除第一个组合标记之外的所有标记，如下所示：

#!/usr/bin/perl
use strict;
use utf8;

binmode(STDOUT, ':utf8');

my $string = "กิิ ก้้ ก็็ ก็็ กิิ ก้้ ก็็ กิิ ก้้ กิิ ก้้ ก็็ ก็็ กิิ ก้้ ก็็ กิิ ก้้";
$string =~ s/(\p{Mark})\p{Mark}+/$1/g; # Strip excess combining marks
print("$string\n");

This will print: 这将打印：

กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้ กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้ กิก้ก็ก็กิก้ก็กิก้กิก้ก็ก็กิก้ก็กิก้

Answer 3

"How can we sanitize this" is best answered above by TJ Crowder “我们如何消毒这个”最好由TJ Crowder在上面回答

However, I think sanitization is the wrong approach, and Cristy has it right with overflow:hidden on the css containing element. 但是，我认为清理是错误的方法，而Cristy正确地使用overflow:hidden在包含css的元素上。

At least, that's how I'm solving it. 至少，这就是我解决它的方式。

Answer 4

Ok this one took me a while to figure out, I was under impression that combining characters to produce zalgo are limited to these . 好吧，这个让我花了一段时间来弄明白，我的印象是，将角色组合起来制作zalgo 仅限于这些。 So I expected following regex to catch the freaks. 所以我期待跟随正则表达式捕捉怪胎。

([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]{2,})

and it didn't work... 它不起作用......

The catch is that list in wiki does not cover full range of combining characters. 问题是wiki中的列表并未涵盖所有组合字符。

What gave me a hint is "ก้้้้้้้้้้้้้้้้้้้้".charCodeAt(2).toString(16) = "e49" which in not within a range of combining, it falls into 'Private use'. 给我提示的是"ก้้้้้้้้้้้้้้้้้้้้".charCodeAt(2).toString(16) =“e49”，它不在合并范围内，属于“私人使用”。

In C# they fall under UnicodeCategory.NonSpacingMark and following script flushes them out: 在C＃中，它们属于UnicodeCategory.NonSpacingMark ，后面的脚本将它们刷新：

    [Test]
    public void IsZalgo()
    {
        var zalgo = new[] { UnicodeCategory.NonSpacingMark };

        File.Delete("IsModifyLike.html");
        File.AppendAllText("IsModifyLike.html", "<table>");
        for (var i = 0; i < 65535; i++)
        {
            var c = (char)i;
            if (zalgo.Contains(Char.GetUnicodeCategory(c)))
            {


                File.AppendAllText("IsModifyLike.html", string.Format("<tr><td>{0}</td><td>{1}</td><td>{2}</td><td>A&#{3};&#{3};&#{3}</td></tr>\n",  i.ToString("X"), c, Char.GetUnicodeCategory(c), i));

            }
        }
        File.AppendAllText("IsModifyLike.html", "</table>");
    }

By looking at the table generated you should be able to see which ones do stack. 通过查看生成的表，您应该能够看到哪些堆栈。 One range that is missing on wiki is 06D6-06DC another 0730-0749 . 维基上缺少的一个范围是06D6-06DC另一个0730-0749 。

UPDATE: 更新：

Here's updated regex that should fish out all the zalgo including ones bypassed in 'normal' range. 这里有更新的正则表达式，应该删除所有zalgo，包括在“正常”范围内绕过的zalgo。

([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62]{2,})

The hardest bit is to identify them, once you have done that - there's multitude of solutions including some good ones above. 最难的是识别它们，一旦你做到了 - 有很多解决方案，包括上面的一些好的解决方案。

Hope this saves you some time. 希望这能为您节省一些时间。

这些Unicode组合字符有什么用处，我们如何过滤它们？

问题描述

4 个解决方案

解决方案1
80 已采纳 2012-05-02 13:42:44

解决方案2
18 2013-03-12 18:33:33

กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้ กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้ กิก้ก็ก็กิก้ก็กิก้กิก้ก็ก็กิก้ก็กิก้

解决方案3
12 2013-03-12 18:00:08

解决方案4
7 2016-03-17 12:38:48

这些Unicode组合字符有什么用处，我们如何过滤它们？

问题描述

4 个解决方案

解决方案1 80 已采纳 2012-05-02 13:42:44

解决方案2 18 2013-03-12 18:33:33

กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้ กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้ กิก้ก็ก็กิก้ก็กิก้กิก้ก็ก็กิก้ก็กิก้

解决方案3 12 2013-03-12 18:00:08

解决方案4 7 2016-03-17 12:38:48

解决方案1
80 已采纳 2012-05-02 13:42:44

解决方案2
18 2013-03-12 18:33:33

解决方案3
12 2013-03-12 18:00:08

解决方案4
7 2016-03-17 12:38:48