简体   繁体   English

如何在c ++字符串中搜索非ASCII字符?

[英]How to search a non-ASCII character in a c++ string?

string s="x1→(y1⊕y2)∧z3";

for(auto i=s.begin(); i!=s.end();i++){
    if(*i=='→'){
       ...
    }
} 

The char comparing is definitely wrong, what's the correct way to do it? 字符比较肯定是错误的,这是正确的方法吗? I am using vs2013. 我正在使用vs2013。

First you need some basic understanding of how programs handle Unicode. 首先,您需要对程序如何处理Unicode有一些基本的了解。 Otherwise, you should read up, I quite like this post on Joel on Software . 否则,你应该阅读,我非常喜欢这篇关于Joel on Software的帖子

You actually have 2 problems here: 你实际上有两个问题:

Problem #1: getting the string into your program 问题#1:将字符串输入程序

Your first problem is getting that actual string in your string s . 你的第一个问题就是如何让你的实际的字符串, string s Depending on the encoding of your source code file, MSVC may corrupt any non-ASCII characters in that string. 根据源代码文件的编码,MSVC可能会破坏该字符串中的任何非ASCII字符。

  • either save your C++ file as UTF-16 (which Windows confusingly calls Unicode ), and use whcar_t and wstring (effectively encoding the expression as UTF-16). 将您的C ++文件保存为UTF-16(Windows令人困惑地称之为Unicode ),并使用whcar_twstring (有效地将表达式编码为UTF-16)。 Saving as UTF-8 with BOM will also work. 使用BOM保存为UTF-8也可以。 Any other encoding and your L"..." character literals will contain the wrong characters. 任何其他编码和您的L"..."字符文字将包含错误的字符。

    Note that other platforms may define wchar_t as 4 bytes instead of 2. So the handling of characters above U+FFFF will be non-portable. 请注意,其他平台可能将wchar_t定义为4个字节而不是2个。因此,对U + FFFF以上字符的处理将是不可移植的。

  • In all other cases, you can't just write those characters in your source file. 在所有其他情况下,您不能只在源文件中写入这些字符。 The most portable way is encoding your string literals as UTF-8, using \\x escape codes for all non-ASCII characters. 最便携的方法是将字符串文字编码为UTF-8,对所有非A​​SCII字符使用\\x转义码。 Like this: "x1\\xe2\\x86\\x92a\\xe2\\x8a\\x95" "b)" rather than "x1→(a⊕b)" . 像这样: "x1\\xe2\\x86\\x92a\\xe2\\x8a\\x95" "b)"而不是"x1→(a⊕b)"

    And yes, that's as unreadable and cumbersome as it gets. 是的,这就像它变得难以理解和繁琐一样。 The root problem is MSVC doesn't really support using UTF-8. 根本问题是MSVC并不真正支持使用UTF-8。 You can go through this question here for an overview: How to create a UTF-8 string literal in Visual C++ 2008 . 您可以在此处查看此问题以获取概述: 如何在Visual C ++ 2008中创建UTF-8字符串文字

    But, also consider how often those strings will actually show up in your source code. 但是,还要考虑这些字符串实际显示在源代码中的频率。

Problem #2: finding the character 问题#2:找到角色

(If you're using UTF-16, you can just find the L'→' character, since that character is representable as one whcar_t . For characters above U+FFFF you'll have to use the wide version of the workaround below.) (如果你使用的是UTF-16,你可以找到L'→'字符,因为那个字符可以表示为一个whcar_t 。对于U + FFFF以上的字符,你将不得不使用下面的宽泛版本的变通方法。 )

It's impossible to define a char representing the arrow character. 定义表示箭头字符的char是不可能的。 You can however with a string: "\\xe2\\x86\\x92" . 但是你可以用一个字符串: "\\xe2\\x86\\x92" (that's a string with 3 chars for the arrow, and the \\0 terminator. (这是一个包含3个箭头字符的字符串,以及\\0终结符。

You can now search for this string in your expression: 您现在可以在表达式中搜索此字符串:

s.find("\xe2\x86\x92");

The UTF-8 encoding scheme guarantees this always finds the correct character, but keep in mind this is an offset in bytes . UTF-8编码方案保证始终找到正确的字符,但请记住这是一个以字节为单位的偏移量。

My comment is too large, so i am submitting it as an answer. 我的评论太大了,所以我将其作为答案提交。

The problem is that everybody is concentrating on the issue of different encodings that Unicode may use (UTF-8, UTF-16, UCS2, etc). 问题是每个人都在关注Unicode可能使用的不同编码问题(UTF-8,UTF-16,UCS2等)。 But your problems here will just begin. 但是你的问题才刚刚开始。

There is also an issue of composite characters, which will really mess up any search that you are trying to make. 还有一个复合字符的问题,这将真正搞乱你想要进行的任何搜索。

Let's say you are looking for a character 'é', you find it in Unicode as U+00E9 and do your search, but it is not guaranteed that this is the only way to represent this character. 假设您正在寻找一个字符'é',您会在Unicode中找到它作为U + 00E9并进行搜索,但不能保证这是表示此字符的唯一方法。 The document may also contain U+0065 U+0301 combination. 该文件还可能包含U + 0065 U + 0301组合。 Which is actually exactly the same character. 这实际上是完全相同的角色。

Yes, not just "character that looks the same", but it is exactly the same, so any software and even some programming libraries will freely convert from one to another without even telling you. 是的,不仅仅是“看起来相同的角色”,而且它完全相同,所以任何软件甚至一些编程库都可以在不告诉你的情况下自由地从一个转换到另一个。

So if you wish to make a search, that is robust, you will need something that represents not just different encodings of Unicode, but Unicode characters themselves with equality between Composite and Ready-Made chars. 因此,如果您希望进行搜索,这是强大的,您需要的东西不仅代表Unicode的不同编码,而且Unicode字符本身在Composite和现成字符之间具有相等性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM