[英]How do I read UTF-8 characters via a pointer?
Suppose I have UTF-8 content stored in memory, how do I read the characters using a pointer? 假设我有UTF-8内容存储在内存中,如何使用指针读取字符? I presume I need to watch for the 8th bit indicating a multi-byte character, but how exactly do I turn the sequence into a valid Unicode character?
我想我需要注意指示多字节字符的第8位,但是如何准确地将序列转换为有效的Unicode字符呢? Also, is
wchar_t
the proper type to store a single Unicode character? 另外,
wchar_t
是否适合存储单个Unicode字符?
This is what I have in mind: 这就是我的想法:
wchar_t readNextChar (char*& p)
{
wchar_t unicodeChar;
char ch = *p++;
if ((ch & 128) != 0)
{
// This is a multi-byte character, what do I do now?
// char chNext = *p++;
// ... but how do I assemble the Unicode character?
...
}
...
return unicodeChar;
}
You have to decode the UTF-8 bit pattern to its unencoded UTF-32 representation. 您必须将UTF-8位模式解码为未编码的UTF-32表示形式。 If you want the actual Unicode codepoint, you have to use a 32-bit data type.
如果需要实际的Unicode代码点,则必须使用32位数据类型。
On Windows, wchar_t
is NOT large enough, as it is only 16-bit. 在Windows上,
wchar_t
不够大,因为它只有16位。 You have to use an unsigned int
or unsigned long
instead. 您必须改用
unsigned int
或unsigned long
。 Use wchar_t
only when dealing with UTF-16 codeunits instead. 仅在处理UTF-16代码单元时才使用
wchar_t
。
On other platforms, wchar_t
is usually 32bit. 在其他平台上,
wchar_t
通常为32位。 But when writing portable code, you should stay away from wchar_t
except where absolutely needed (like std::wstring
). 但是在编写可移植代码时,除了绝对需要的地方(例如
std::wstring
),您应该远离wchar_t
。
Try something more like this: 尝试更多类似这样的方法:
#define IS_IN_RANGE(c, f, l) (((c) >= (f)) && ((c) <= (l)))
u_long readNextChar (char* &p)
{
// TODO: since UTF-8 is a variable-length
// encoding, you should pass in the input
// buffer's actual byte length so that you
// can determine if a malformed UTF-8
// sequence would exceed the end of the buffer...
u_char c1, c2, *ptr = (u_char*) p;
u_long uc = 0;
int seqlen;
// int datalen = ... available length of p ...;
/*
if( datalen < 1 )
{
// malformed data, do something !!!
return (u_long) -1;
}
*/
c1 = ptr[0];
if( (c1 & 0x80) == 0 )
{
uc = (u_long) (c1 & 0x7F);
seqlen = 1;
}
else if( (c1 & 0xE0) == 0xC0 )
{
uc = (u_long) (c1 & 0x1F);
seqlen = 2;
}
else if( (c1 & 0xF0) == 0xE0 )
{
uc = (u_long) (c1 & 0x0F);
seqlen = 3;
}
else if( (c1 & 0xF8) == 0xF0 )
{
uc = (u_long) (c1 & 0x07);
seqlen = 4;
}
else
{
// malformed data, do something !!!
return (u_long) -1;
}
/*
if( seqlen > datalen )
{
// malformed data, do something !!!
return (u_long) -1;
}
*/
for(int i = 1; i < seqlen; ++i)
{
c1 = ptr[i];
if( (c1 & 0xC0) != 0x80 )
{
// malformed data, do something !!!
return (u_long) -1;
}
}
switch( seqlen )
{
case 2:
{
c1 = ptr[0];
if( !IS_IN_RANGE(c1, 0xC2, 0xDF) )
{
// malformed data, do something !!!
return (u_long) -1;
}
break;
}
case 3:
{
c1 = ptr[0];
c2 = ptr[1];
switch (c1)
{
case 0xE0:
if (!IS_IN_RANGE(c2, 0xA0, 0xBF))
{
// malformed data, do something !!!
return (u_long) -1;
}
break;
case 0xED:
if (!IS_IN_RANGE(c2, 0x80, 0x9F))
{
// malformed data, do something !!!
return (u_long) -1;
}
break;
default:
if (!IS_IN_RANGE(c1, 0xE1, 0xEC) && !IS_IN_RANGE(c1, 0xEE, 0xEF))
{
// malformed data, do something !!!
return (u_long) -1;
}
break;
}
break;
}
case 4:
{
c1 = ptr[0];
c2 = ptr[1];
switch (c1)
{
case 0xF0:
if (!IS_IN_RANGE(c2, 0x90, 0xBF))
{
// malformed data, do something !!!
return (u_long) -1;
}
break;
case 0xF4:
if (!IS_IN_RANGE(c2, 0x80, 0x8F))
{
// malformed data, do something !!!
return (u_long) -1;
}
break;
default:
if (!IS_IN_RANGE(c1, 0xF1, 0xF3))
{
// malformed data, do something !!!
return (u_long) -1;
}
break;
}
break;
}
}
for(int i = 1; i < seqlen; ++i)
{
uc = ((uc << 6) | (u_long)(ptr[i] & 0x3F));
}
p += seqlen;
return uc;
}
Here is a quick macro that will count UTF-8 bytes 这是一个快速的宏,它将计数UTF-8字节
#define UTF8_CHAR_LEN( byte ) (( 0xE5000000 >> (( byte >> 3 ) & 0x1e )) & 3 ) + 1
This will help you detect the size of the UTF-8 character for easier parsing. 这将帮助您检测UTF-8字符的大小,以便于解析。
If you need to decode UTF-8 you need do develop an UTF-8 parser. 如果您需要解码UTF-8,则需要开发一个UTF-8解析器。 UTF-8 is a variable-length encoding (1 to 4 bytes) so you really have to write a parser that is compliant with the standard : see wikipedia for example.
UTF-8是一种可变长度的编码(1到4个字节),因此您实际上必须编写一个符合标准的解析器:例如,参见Wikipedia 。
If you do not want to write your own parser, I suggest to use a library. 如果您不想编写自己的解析器,建议使用一个库。 You will find that in glib for example (I personnaly have used Glib::ustring, the C++ wrapper of glib) but also in any good general purpose library.
例如,您会在glib中发现这一点(我个人曾经使用过glib的C ++包装器Glib :: ustring),但也可以在任何通用库中找到它。
Edit: 编辑:
I think that C++0x will include UTF-8 support too, but I'm no specialist... 我认为C ++ 0x也将包括UTF-8支持,但我不是专家...
my2c my2c
Also, is wchar_t the proper type to store a single Unicode character?
另外,wchar_t是否适合存储单个Unicode字符?
On Linux, yes. 在Linux上,是的。 On Windows,
wchar_t
represents a UTF-16 code unit, which isn't necessarily a character. 在Windows上,
wchar_t
表示UTF-16代码单元,不一定是字符。
The upcoming C++0x standard will provide the char16_t
and char32_t
types designed to represent UTF-16 and UTF-32. 即将到来的C ++ 0x标准将提供旨在表示UTF-16和UTF-32的
char16_t
和char32_t
类型。
If on a system where char32_t
is unavailable and wchar_t
is inadequate, use uint32_t
to store Unicode characters. 如果在无法使用
char32_t
且wchar_t
不足的系统上,请使用uint32_t
存储Unicode字符。
This is my solution, in pure ANSI-C, including a unit test for the corner cases. 这是我在纯ANSI-C中的解决方案,包括针对极端情况的单元测试。
Beware that int
must be at least 32 bits wide. 注意,
int
必须至少为32位宽。 Otherwise you have to change the definition of codepoint
. 否则,您必须更改
codepoint
的定义。
#include <assert.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
typedef unsigned char byte;
typedef unsigned int codepoint;
/**
* Reads the next UTF-8-encoded character from the byte array ranging
* from {@code *pstart} up to, but not including, {@code end}. If the
* conversion succeeds, the {@code *pstart} iterator is advanced,
* the codepoint is stored into {@code *pcp}, and the function returns
* 0. Otherwise the conversion fails, {@code errno} is set to
* {@code EILSEQ} and the function returns -1.
*/
int
from_utf8(const byte **pstart, const byte *end, codepoint *pcp) {
size_t len, i;
codepoint cp, min;
const byte *buf;
buf = *pstart;
if (buf == end)
goto error;
if (buf[0] < 0x80) {
len = 1;
min = 0;
cp = buf[0];
} else if (buf[0] < 0xC0) {
goto error;
} else if (buf[0] < 0xE0) {
len = 2;
min = 1 << 7;
cp = buf[0] & 0x1F;
} else if (buf[0] < 0xF0) {
len = 3;
min = 1 << (5 + 6);
cp = buf[0] & 0x0F;
} else if (buf[0] < 0xF8) {
len = 4;
min = 1 << (4 + 6 + 6);
cp = buf[0] & 0x07;
} else {
goto error;
}
if (buf + len > end)
goto error;
for (i = 1; i < len; i++) {
if ((buf[i] & 0xC0) != 0x80)
goto error;
cp = (cp << 6) | (buf[i] & 0x3F);
}
if (cp < min)
goto error;
if (0xD800 <= cp && cp <= 0xDFFF)
goto error;
if (0x110000 <= cp)
goto error;
*pstart += len;
*pcp = cp;
return 0;
error:
errno = EILSEQ;
return -1;
}
static void
assert_valid(const byte **buf, const byte *end, codepoint expected) {
codepoint cp;
if (from_utf8(buf, end, &cp) == -1) {
fprintf(stderr, "invalid unicode sequence for codepoint %u\n", expected);
exit(EXIT_FAILURE);
}
if (cp != expected) {
fprintf(stderr, "expected %u, got %u\n", expected, cp);
exit(EXIT_FAILURE);
}
}
static void
assert_invalid(const char *name, const byte **buf, const byte *end) {
const byte *p;
codepoint cp;
p = *buf + 1;
if (from_utf8(&p, end, &cp) == 0) {
fprintf(stderr, "unicode sequence \"%s\" unexpectedly converts to %#x.\n", name, cp);
exit(EXIT_FAILURE);
}
*buf += (*buf)[0] + 1;
}
static const byte valid[] = {
0x00, /* first ASCII */
0x7F, /* last ASCII */
0xC2, 0x80, /* first two-byte */
0xDF, 0xBF, /* last two-byte */
0xE0, 0xA0, 0x80, /* first three-byte */
0xED, 0x9F, 0xBF, /* last before surrogates */
0xEE, 0x80, 0x80, /* first after surrogates */
0xEF, 0xBF, 0xBF, /* last three-byte */
0xF0, 0x90, 0x80, 0x80, /* first four-byte */
0xF4, 0x8F, 0xBF, 0xBF /* last codepoint */
};
static const byte invalid[] = {
1, 0x80,
1, 0xC0,
1, 0xC1,
2, 0xC0, 0x80,
2, 0xC2, 0x00,
2, 0xC2, 0x7F,
2, 0xC2, 0xC0,
3, 0xE0, 0x80, 0x80,
3, 0xE0, 0x9F, 0xBF,
3, 0xED, 0xA0, 0x80,
3, 0xED, 0xBF, 0xBF,
4, 0xF0, 0x80, 0x80, 0x80,
4, 0xF0, 0x8F, 0xBF, 0xBF,
4, 0xF4, 0x90, 0x80, 0x80
};
int
main() {
const byte *p, *end;
p = valid;
end = valid + sizeof valid;
assert_valid(&p, end, 0x000000);
assert_valid(&p, end, 0x00007F);
assert_valid(&p, end, 0x000080);
assert_valid(&p, end, 0x0007FF);
assert_valid(&p, end, 0x000800);
assert_valid(&p, end, 0x00D7FF);
assert_valid(&p, end, 0x00E000);
assert_valid(&p, end, 0x00FFFF);
assert_valid(&p, end, 0x010000);
assert_valid(&p, end, 0x10FFFF);
p = invalid;
end = invalid + sizeof invalid;
assert_invalid("80", &p, end);
assert_invalid("C0", &p, end);
assert_invalid("C1", &p, end);
assert_invalid("C0 80", &p, end);
assert_invalid("C2 00", &p, end);
assert_invalid("C2 7F", &p, end);
assert_invalid("C2 C0", &p, end);
assert_invalid("E0 80 80", &p, end);
assert_invalid("E0 9F BF", &p, end);
assert_invalid("ED A0 80", &p, end);
assert_invalid("ED BF BF", &p, end);
assert_invalid("F0 80 80 80", &p, end);
assert_invalid("F0 8F BF BF", &p, end);
assert_invalid("F4 90 80 80", &p, end);
return 0;
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.