简体   繁体   English

C ++ 11 Regex中的UTF-8字符范围

[英]Range of UTF-8 Characters in C++11 Regex

This question is an extension of Do C++11 regular expressions work with UTF-8 strings? 这个问题是Do C ++ 11正则表达式与UTF-8字符串一起使用的扩展吗?

#include <regex>  
if (std::regex_match ("中", std::regex("中") ))  // "\u4e2d" also works
  std::cout << "matched\n";

The program is compiled on Mac Mountain Lion with clang++ with the following options: 该程序使用clang++在Mac Mountain Lion上clang++具有以下选项:

clang++ -std=c++0x -stdlib=libc++

The code above works. 上面的代码有效。 This is a standard range regex "[一-龠々〆ヵヶ]" for matching any Japanese Kanji or Chinese character. 这是一个标准范围正则表达式"[一-龠々〆ヵヶ]"用于匹配任何日文汉字或汉字。 It works in Javascript and Ruby, but I can't seem to get ranges working in C++11, even with using a similar version [\一-\龠] . 它适用于Javascript和Ruby,但我似乎无法在C ++ 11中使用范围,即使使用类似的版本[\一-\龠] The code below does not match the string. 下面的代码与字符串不匹配。

if (std::regex_match ("中", std::regex("[一-龠々〆ヵヶ]")))
  std::cout << "range matched\n";

Changing locale hasn't helped either. 改变语言环境也没有帮助。 Any ideas? 有任何想法吗?

EDIT 编辑

So I have found that all ranges work if you add a + to the end. 所以我发现如果你在末尾添加一个+ ,所有范围都有效。 In this case [一-龠々〆ヵヶ]+ , but if you add {1} [一-龠々〆ヵヶ]{1} it does not work. 在这种情况下[一-龠々〆ヵヶ]+ ,但如果添加{1} [一-龠々〆ヵヶ]{1}则不起作用。 Moreover, it seems to overreach it's boundaries. 而且,它似乎超越了它的界限。 It won't match latin characters, but it will match which is \は and which is \ぁ . 它不匹配拉丁字符,但它将匹配 ,即\は ,即\ぁ They both lie below \一 它们都位于\一

nhahtdh also suggested regex_search which also works without adding + but it still runs into the same problem as above by pulling values outside of its range. nhahtdh还提出了regex_search,它也可以在不添加+情况下工作,但它仍然会遇到与上面相同的问题,因为它会超出其范围。 Played with the locales a bit as well. 同时也使用了语言环境。 Mark Ransom suggests it treats the UTF-8 string as a dumb set of bytes, I think this is possibly what it is doing. Mark Ransom建议它将UTF-8字符串视为一组愚蠢的字节,我认为这可能就是它所做的。

Further pushing the theory that UTF-8 is getting jumbled some how, [az]{1} and [az]+ matches a , but only [一-龠々〆ヵヶ]+ matches any of the characters, not [一-龠々〆ヵヶ]{1} . 进一步推动UTF-8混淆的理论, [az]{1}[az]+匹配a ,但只有[一-龠々〆ヵヶ]+匹配任何一个字符,而不是[一-龠々〆ヵヶ]{1}

Encoded in UTF-8, the string "[一-龠々〆ヵヶ]" is equal to this one: "[\\xe4\\xb8\\x80-\\xe9\\xbe\\xa0\\xe3\\x80\\x85\\xe3\\x80\\x86\\xe3\\x83\\xb5\\xe3\\x83\\xb6]" . 以UTF-8编码,字符串"[一-龠々〆ヵヶ]"等于这一个: "[\\xe4\\xb8\\x80-\\xe9\\xbe\\xa0\\xe3\\x80\\x85\\xe3\\x80\\x86\\xe3\\x83\\xb5\\xe3\\x83\\xb6]" And this is not the droid character class you are looking for. 这不是您正在寻找的 机器人 角色类。

The character class you are looking for is the one that includes: 您正在寻找的角色类包括:

  • any character in the range U+4E00..U+9FA0; U + 4E00..U + 9FA0范围内的任何字符; or 要么
  • any of the characters 々, 〆, ヵ, ヶ. 任何字符々,〆,ヵ,ヶ。

The character class you specified is the one that includes: 您指定的字符类包括:

  • any of the "characters" \\xe4 or \\xb8; 任何“字符”\\ xe4或\\ xb8; or 要么
  • any "character" in the range \\x80..\\xe9; \\ x80 .. \\ xe9范围内的任何“字符”; or 要么
  • any of the "characters" \\xbe, \\xa0, \\xe3, \\x80, \\x85, \\xe3 (again), \\x80 (again), \\x86, \\xe3 (again), \\x83, \\xb5, \\xe3 (again), \\x83 (again), \\xb6. 任何“字符”\\ xbe,\\ xa0,\\ xe3,\\ x80,\\ x85,\\ xe3(再次),\\ x80(再次),\\ x86,\\ xe3(再次),\\ x83,\\ xb5,\\ xe3 (再次),\\ x83(再次),\\ xb6。

Messy isn't it? 凌乱不是吗? Do you see the problem? 你看到了问题吗?

This will not match "latin" characters (which I assume you mean things like az) because in UTF-8 those all use a single byte below 0x80, and none of those is in that messy character class. 这与“拉丁”字符不匹配(我假设你的意思是像az这样的东西),因为在UTF-8中,所有字符都使用低于0x80的单个字节,而且这些字符都不在那个混乱的字符类中。

It will not match "中" either because "中" has three "characters", and your regex matches only one "character" out of that weird long list. 它不会匹配"中" ,因为"中"有三个“字符”,而你的正则表达式只匹配那个奇怪的长列表中的一个“字符”。 Try assert(std::regex_match("中", std::regex("..."))) and you will see. 尝试assert(std::regex_match("中", std::regex("..."))) ,你会看到。

If you add a + it works because "中" has three of those "characters" in your weird long list, and now your regex matches one or more. 如果你添加+它是有效的,因为"中"在你奇怪的长列表"中"有三个“字符”,现在你的正则表达式匹配一个或多个。

If you instead add {1} it does not match because we are back to matching three "characters" against one. 如果您改为添加{1}它不匹配,因为我们回到匹配三个“字符”与一个。

Incidentally "中" matches "中" because we are matching the three "characters" against the same three "characters" in the same order. 顺便说一句, "中"匹配"中"因为我们将三个“字符”与相同顺序的相同三个“字符”匹配。

That the regex with + will actually match some undesired things because it does not care about order. +的正则表达式实际上会匹配一些不需要的东西,因为它不关心顺序。 Any character that can be made from that list of bytes in UTF-8 will match. 可以从UTF-8中的字节列表中生成的任何字符都匹配。 It will match "\\xe3\\x81\\x81" (ぁ U+3041) and it will even match invalid UTF-8 input like "\\xe3\\xe3\\xe3\\xe3" . 它将匹配"\\xe3\\x81\\x81" (ぁU+ 3041),它甚至会匹配无效的UTF-8输入,如"\\xe3\\xe3\\xe3\\xe3"

The bigger problem is that you are using a regex library that does not even have level 1 support for Unicode, the bare minimum required. 更大的问题是你正在使用一个甚至没有1级支持Unicode的正则表达式库,这是最低要求。 It munges bytes and there isn't much your precious tiny regex can do about it. 它会消耗大量的字节,而且你的珍贵的小正则表达式对它没有多大帮助。

And the even bigger problem is that you are using a hardcoded set of characters to specify "any Japanese Kanji or Chinese character". 更大的问题是你使用一组硬编码的字符来指定“任何日文汉字或汉字”。 Why not use the Unicode Script property for that? 为什么不使用Unicode Script属性呢?

R"(\\p{Script=Han})"

Oh right, this won't work with C++11 regexes. 哦,对,这不适用于C ++ 11正则表达式。 For a moment there I almost forgot those are annoyingly worse than useless with Unicode. 在那里,我几乎忘记了那些比使用Unicode无用的烦人。

So what should you do? 那你该怎么办?

You could decode your input into a std::u32string and use char32_t all over for the matching. 您可以将输入解码为std::u32string并使用char32_t进行匹配。 That would not give you this mess, but you would still be hardcoding ranges and exceptions when you mean "a set of characters that share a certain property". 这不会给你这个烂摊子,但当你的意思是“一组共享某个属性的字符”时,你仍然会硬编码范围和异常。

I recommend you forget about C++11 regexes and use some regular expression library that has the bare minimum level 1 Unicode support, like the one in ICU . 我建议你忘记C ++ 11正则表达式并使用一些具有最低1级Unicode支持的正则表达式库,就像ICU中那样

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM