简体   繁体   中英

UTF-8 decoder fails on non-ASCII characters

Note: if you've followed my recent questions, you'll see that they're all about my Unicode library exercise in C -- as one of my first few serious projects in C, I'm having many problems, so I'm sorry if I'm asking too many questions about one thing.

Part of my library decodes UTF-8 encoded char pointers into raw unsigned code points. However, certain planes don't decode correctly. Let's take a look at the (relevant) code:

typedef struct string {
 unsigned long length;
 unsigned *data;
} string;

// really simple stuff

string *upush(string *s, unsigned c) {
 if (!s->length) s->data = (unsigned *) malloc((s->length = 1) * sizeof(unsigned));
 else   s->data = (unsigned *) realloc(s->data, ++s->length * sizeof(unsigned));
 s->data[s->length - 1] = c;
 return s;
}

// UTF-8 conversions

string ctou(char *old) {
 unsigned long i, byte = 0, cur = 0;
 string new;
 new.length = 0;
 for (i = 0; old[i]; i++)
  if (old[i] < 0x80) upush(&new, old[i]);
  else if (old[i] < 0xc0)
   if (!byte) {
    byte = cur = 0;
    continue;
   } else {
    cur |= (unsigned)(old[i] & 0x3f) << (6 * (--byte));
    if (!byte) upush(&new, cur), cur = 0;
   }
  else if (old[i] < 0xc2) continue;
  else if (old[i] < 0xe0) {
   cur = (unsigned)(old[i] & 0x1f) << 6;
   byte = 1;
  }
  else if (old[i] < 0xf0) {
   cur = (unsigned)(old[i] & 0xf) << 12;
   byte = 2;
  }
  else if (old[i] < 0xf5) {
   cur = (unsigned)(old[i] & 0x7) << 18;
   byte = 3;
  }
  else continue;
 return new;
}

All upush does, by the way, is pushes a code point onto the end of a string , reallocating memory as needed. ctou does the decoding work, and stores the number of bytes still needed in a sequence in byte , as well as the in-progress code point in cur .

The code all seems correct to me. Let's try decoding U+10ffff , which is f4 8f bf bd in UTF-8. Doing this:

long i;
string b = ctou("\xf4\x8f\xbf\xbd");
for (i = 0; i < b.length; i++)
 printf("%z ", b.data[i]);

should print out:

10ffff

but instead it prints out:

fffffff4 ffffff8f ffffffbf ffffffbd

which is basically the four bytes of UTF-8, with ffffff tacked on before it.

Any guidance as to what is wrong in my code?

The char type is allowed to be signed, and conversion to int and then unsigned (which is what happens implicitly when you convert directly to unsigned) shows the error:

#include <stdio.h>

int main() {
  char c = '\xF4';
  int i = c;
  unsigned n = i;
  printf("%X\n", n);
  n = c;
  printf("%X\n", n);
  return 0;
}

Prints:

FFFFFFF4
FFFFFFF4

Use unsigned char instead.

You've probably ignored the fact that char is a signed type on your platform. Always use:

  • unsigned char if you will be reading the actual values of bytes
  • signed char if you're using bytes as small signed integers
  • char for abstract strings where you don't care about the values except perhaps for 0.

By the way, your code is extremely inefficient. Instead of calling realloc over and over per-character, why not allocate sizeof(unsigned)*(strlen(old)+1) to begin with, then reduce the size at the end if it's too big? Of course this is only one of the many inefficiencies.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM