简体   繁体   English

UTF-8 - > C语言的ASCII

[英]UTF-8 -> ASCII in C language

I have a simple question that I can't find anywhere over the internet, how can I convert UTF-8 to ASCII (mostly accented characters to the same character without accent) in C using only the standard lib? 我有一个简单的问题,我无法通过互联网找到任何地方,如何在C中仅使用标准的lib将UTF-8转换为ASCII(大多数重音字符为相同的字符,无需重音)? I found solutions to most of the languages out there, but not for C particularly. 我找到了大多数语言的解决方案,但特别是C语言。

Thanks! 谢谢!

EDIT: Some of the kind guys that commented made me double check what I needed and I exaggerated. 编辑:评论的一些人让我仔细检查我需要什么,我夸大了。 I only need an idea on how to make a function that does: char with accent -> char without accent. 我只需要一个关于如何创建一个函数的想法:带有重音的char - >没有重音的char。 :) :)

Take a look at libiconv . 看看libiconv Even if you insist on doing it without libraries, you might find an inspiration there. 即使你坚持不用图书馆这样做,你也可能会在那里找到灵感。

In general, you can't. 一般来说,你不能。 UTF-8 covers much more than accented characters. UTF-8涵盖的不仅仅是重音字符。

There's no built in way of doing that. 没有内置的方法可以做到这一点。 There's really little difference between UTF-8 and ASCII unless you're talking about high level characters, which cannot be represented in ASCII anyway. UTF-8和ASCII之间几乎没有什么区别,除非你在谈论高级字符,无论如何都无法用ASCII表示。

If you have a specific mapping you want (such as a with accent -> a) then you should just probably handle that as a string replace operation. 如果你有一个你想要的特定映射(例如带有重音 - > a)那么你应该只是将其作为字符串替换操作来处理。

Every decent Unicode support library (not the standard library of course) has a way to decompose a string in KC or KD form. 每个体面的Unicode支持库(当然不是标准库)都有一种方法来分解KC或KD形式的字符串。 Which separates the diacritics from the letters. 这将变音符号与字母分开。 Giving you a shot at filtering them out. 给你一个过滤它们的机会。 Not so sure this is worth pursuing, the result is just gibberish to the native language reader and not every letter is decomposable. 不太确定这是值得追求的,结果只是对母语读者的胡言乱语,而不是每个字母都是可分解的。 In other words, junk with question marks. 换句话说,垃圾带有问号。

Since this is homework, I'm guessing your teacher is clueless and doesn't know anything about UTF-8, and probably is stuck in the 1980s with "code pages" and "extended ASCII" (words you should erase from your vocabulary if you haven't already). 由于这是家庭作业,我猜你的老师是无能为力的,并且对UTF-8一无所知,并且可能在20世纪80年代被“代码页”和“扩展的ASCII”所困扰(你应该从你的词汇中删除你的词汇,如果你还没有)。 Your teacher probably wants you to write a 128-byte lookup table that maps CP437 or Windows-1252 bytes in the range 128-255 to similar-looking ASCII letters. 您的老师可能希望您编写一个128字节的查找表,将128-255范围内的CP437或Windows-1252字节映射到类似的ASCII字母。 It would go something like... 它会像...

void strip_accents(unsigned char *dest, const unsigned char *src)
{
    static const unsigned char lut[128] = { /* mapping here */ };
    do {
        *dest++ = *src < 128 ? *src : lut[*src];
    } while (*src++);
 }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM