如何正确跳过 unicode (UTF-8) 字符？

Question

I have written a parser that turns out works incorrectly with UTF-8 texts.我编写了一个解析器，结果证明它无法正确处理 UTF-8 文本。

The parser is very very simple:解析器非常非常简单：

while(pos < end) { 

// find some ASCII char
if (text.at(pos) == '@') {
// Check some conditions and if the syntax is wrong...
if (...)
  createDiagnostic(pos);
} 

pos++;
}

So you can see I am creating a diagnostic at pos .所以你可以看到我正在pos创建一个诊断。 But that pos is wrong if there were some UTF-8 characters (because UTF-8 characters in reality consists of more than one char . How do I correctly skip the UTF-8 chars as if they are one character?但是，如果有一些 UTF-8 字符（因为 UTF-8 字符实际上包含多个char 。我如何正确跳过 UTF-8 字符，如果它们是一个字符？

I need this because the diagnostics are sent to UTF-8-aware VSCode.我需要这个，因为诊断被发送到支持 UTF-8 的 VSCode。

I tried to read some articles on UTF-8 in C++ but every material I found is huge.我试图在 C++ 中阅读有关 UTF-8 的一些文章，但我发现的每一个材料都是巨大的。 And I only need to skip the UTF-8.我只需要跳过 UTF-8。

Answer 1

If the code point is less than 128, then UTF-8 encodes it as ASCII (No highest bit set).如果代码点小于 128，则 UTF-8 将其编码为 ASCII（未设置最高位）。 If code point is equal or larger than 128, all the encoded bytes will have the highest bit set.如果代码点等于或大于 128，则所有编码字节都将设置最高位。 So, this will work:因此，这将起作用：

unsigned char b = <...>; // b is a byte from a utf-8 string
if (b&0x80) {
    // ignore it, as b is part of a >=128 codepoint
} else {
    // use b as an ASCII code
}

Note: if you want to calculate the number of UTF-8 codepoints in a string, then you have to count bytes with:注意：如果要计算字符串中 UTF-8 代码点的数量，则必须计算字节数：

!(b&0x80) : this means that the byte is an ASCII character, or !(b&0x80) ：这意味着该字节是一个 ASCII 字符，或者
(b&0xc0)==0xc0 : this means, that the byte is the first byte of a multi-byte UTF8-sequence (b&0xc0)==0xc0 ：这意味着，该字节是多字节 UTF8 序列的第一个字节

如何正确跳过 unicode (UTF-8) 字符？

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-09-22 09:26:00

如何正确跳过 unicode (UTF-8) 字符？

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-09-22 09:26:00

解决方案1
1 已采纳 2019-09-22 09:26:00