简体   繁体   中英

UTF-8 -> ASCII in C language

I have a simple question that I can't find anywhere over the internet, how can I convert UTF-8 to ASCII (mostly accented characters to the same character without accent) in C using only the standard lib? I found solutions to most of the languages out there, but not for C particularly.

Thanks!

EDIT: Some of the kind guys that commented made me double check what I needed and I exaggerated. I only need an idea on how to make a function that does: char with accent -> char without accent. :)

Take a look at libiconv . Even if you insist on doing it without libraries, you might find an inspiration there.

In general, you can't. UTF-8 covers much more than accented characters.

There's no built in way of doing that. There's really little difference between UTF-8 and ASCII unless you're talking about high level characters, which cannot be represented in ASCII anyway.

If you have a specific mapping you want (such as a with accent -> a) then you should just probably handle that as a string replace operation.

Every decent Unicode support library (not the standard library of course) has a way to decompose a string in KC or KD form. Which separates the diacritics from the letters. Giving you a shot at filtering them out. Not so sure this is worth pursuing, the result is just gibberish to the native language reader and not every letter is decomposable. In other words, junk with question marks.

Since this is homework, I'm guessing your teacher is clueless and doesn't know anything about UTF-8, and probably is stuck in the 1980s with "code pages" and "extended ASCII" (words you should erase from your vocabulary if you haven't already). Your teacher probably wants you to write a 128-byte lookup table that maps CP437 or Windows-1252 bytes in the range 128-255 to similar-looking ASCII letters. It would go something like...

void strip_accents(unsigned char *dest, const unsigned char *src)
{
    static const unsigned char lut[128] = { /* mapping here */ };
    do {
        *dest++ = *src < 128 ? *src : lut[*src];
    } while (*src++);
 }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM