简体   繁体   中英

how to get the characters from a utf8 string

char *w = "Artîsté";
printf("%lu\n", strlen(w));
int z;
for(z=0; z<strlen(w); z++){
    //printf("%c", w[z]);  //prints as expected
    printf("%i: %c\n", z, w[z]);//doesn't print anything
}

If I run this, it crashes on the î. How do I print a multibyte char and how do I know when a I've hit a multibyte character?

If your execution environment uses UTF-8 (Linux, for example), your code will work as-is, as long as you set a suitable locale, as in setlocale(LC_ALL, "en_US.utf9"); before calling that printf.

demo: http://ideone.com/zFUYM

Otherwise, your best bet is probably to convert to wide string and print that. If you plan on doing something other than I/O with the individual characters of that string, you will have to do it anyway.

As for hitting a multibyte char, the portable way to test is if mblen() returns a value greater than 1.

Use the wide char and multi-byte functions:

int utf8len(char *str)
{
    char *top=str+strlen(str);
    int len;
    for(len=0; str<top; len++)
        str+=mblen(str, top-str);
    return len;
}

int main()
{
    setlocale(LC_ALL, "en_US.utf8");
    char *w = "Artîsté";
    printf("%lu\n", strlen(w));

    int z, len = utf8len(w);
    wchar_t wstr[len+1];
    mbstowcs(wstr, w, len);
    for(z=0; z<len; z++)
        printf("%i: %lc\n", z, wstr[z]);
}

You got lucky with the first printf, because you never changed the data, once you split up the chars, your output was no longer utf8.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM