简体   繁体   English

UTF-8解码器在非ASCII字符上失败

[英]UTF-8 decoder fails on non-ASCII characters

Note: if you've followed my recent questions, you'll see that they're all about my Unicode library exercise in C -- as one of my first few serious projects in C, I'm having many problems, so I'm sorry if I'm asking too many questions about one thing. 注意:如果您关注了我最近的问题,您会发现它们全都与我在C中的Unicode库有关-作为我在C中的头几个重要项目之一,我遇到了很多问题,所以我对不起,如果我对一件事问太多问题。

Part of my library decodes UTF-8 encoded char pointers into raw unsigned code points. 我的库的一部分将UTF-8编码的char指针解码为原始的unsigned代码点。 However, certain planes don't decode correctly. 但是,某些平面无法正确解码。 Let's take a look at the (relevant) code: 让我们看一下(相关的)代码:

typedef struct string {
 unsigned long length;
 unsigned *data;
} string;

// really simple stuff

string *upush(string *s, unsigned c) {
 if (!s->length) s->data = (unsigned *) malloc((s->length = 1) * sizeof(unsigned));
 else   s->data = (unsigned *) realloc(s->data, ++s->length * sizeof(unsigned));
 s->data[s->length - 1] = c;
 return s;
}

// UTF-8 conversions

string ctou(char *old) {
 unsigned long i, byte = 0, cur = 0;
 string new;
 new.length = 0;
 for (i = 0; old[i]; i++)
  if (old[i] < 0x80) upush(&new, old[i]);
  else if (old[i] < 0xc0)
   if (!byte) {
    byte = cur = 0;
    continue;
   } else {
    cur |= (unsigned)(old[i] & 0x3f) << (6 * (--byte));
    if (!byte) upush(&new, cur), cur = 0;
   }
  else if (old[i] < 0xc2) continue;
  else if (old[i] < 0xe0) {
   cur = (unsigned)(old[i] & 0x1f) << 6;
   byte = 1;
  }
  else if (old[i] < 0xf0) {
   cur = (unsigned)(old[i] & 0xf) << 12;
   byte = 2;
  }
  else if (old[i] < 0xf5) {
   cur = (unsigned)(old[i] & 0x7) << 18;
   byte = 3;
  }
  else continue;
 return new;
}

All upush does, by the way, is pushes a code point onto the end of a string , reallocating memory as needed. 顺便说一句,所有upush所做的就是将代码点压入string的末尾,根据需要重新分配内存。 ctou does the decoding work, and stores the number of bytes still needed in a sequence in byte , as well as the in-progress code point in cur . ctou进行解码工作,并以字节为单位存储序列中仍需要的byte ,以及以cur为单位存储进行中的代码点。

The code all seems correct to me. 该代码对我来说似乎都是正确的。 Let's try decoding U+10ffff , which is f4 8f bf bd in UTF-8. 让我们尝试解码U+10ffff ,它是UTF-8中的f4 8f bf bd Doing this: 这样做:

long i;
string b = ctou("\xf4\x8f\xbf\xbd");
for (i = 0; i < b.length; i++)
 printf("%z ", b.data[i]);

should print out: 应该打印出来:

10ffff

but instead it prints out: 但是它打印出来:

fffffff4 ffffff8f ffffffbf ffffffbd

which is basically the four bytes of UTF-8, with ffffff tacked on before it. 它基本上是UTF-8的四个字节,前面加上ffffff

Any guidance as to what is wrong in my code? 关于我的代码有什么问题的任何指导?

The char type is allowed to be signed, and conversion to int and then unsigned (which is what happens implicitly when you convert directly to unsigned) shows the error: 允许对char类型进行签名,并且先转换为int然后再进行无符号转换(当您直接转换为unsigned时隐式发生)会显示以下错误:

#include <stdio.h>

int main() {
  char c = '\xF4';
  int i = c;
  unsigned n = i;
  printf("%X\n", n);
  n = c;
  printf("%X\n", n);
  return 0;
}

Prints: 印刷品:

FFFFFFF4 FFFFFFF4
FFFFFFF4 FFFFFFF4

Use unsigned char instead. 请改用unsigned char。

You've probably ignored the fact that char is a signed type on your platform. 您可能已经忽略了char是平台上的带符号类型这一事实。 Always use: 始终使用:

  • unsigned char if you will be reading the actual values of bytes unsigned char如果要读取字节的实际值)
  • signed char if you're using bytes as small signed integers 如果您使用字节作为小符号整数,则为signed char
  • char for abstract strings where you don't care about the values except perhaps for 0. char代表抽象字符串,在这里您不需要关心任何值,除了0之外。

By the way, your code is extremely inefficient. 顺便说一句,您的代码效率极低。 Instead of calling realloc over and over per-character, why not allocate sizeof(unsigned)*(strlen(old)+1) to begin with, then reduce the size at the end if it's too big? 而不是一遍又一遍地按字符调用realloc ,为什么不先分配sizeof(unsigned)*(strlen(old)+1) ,然后如果太大则减小其大小? Of course this is only one of the many inefficiencies. 当然,这只是许多低效率中的一种。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM