简体   繁体   English

寻找支持UTF8的格式化函数,如printf()等

[英]Looking for UTF8-aware formatting functions like printf(), etc

I discovered an interesting problem when processing UTF-8 strings containing non-ASCII chars with C standard library formatting functions like sprintf(): 我在使用C标准库格式化函数(如sprintf())处理包含非ASCII字符的UTF-8字符串时发现了一个有趣的问题:

The functions of the printf() family are not aware of utf-8 and process everything based on the number of bytes, not chars. printf()系列的功能不知道utf-8并根据字节数而不是字符处理所有内容。 Therefore the formatting is incorrect. 因此格式不正确。

Simple example: 简单的例子:

#include <stdio.h>

int main(int argc, char *argv[])
{
    const char* testMsg = "Tääääßt";
    char buf[1024];
    int len;

    sprintf(buf, "|%7.7s|", testMsg);
    len = strlen(buf);
    printf("Result=\"%s\", len=%d", buf, len);

    return 0;
}

The result is: 结果是:

 Result="|Täää|", len=7

Most probably some of you will recommand to convert the application from char to wchar_t and use fwprintf(), etc., but that's absolutely impossible because of huge existing applications. 很可能你们中的一些人会建议将应用程序从char转换为wchar_t并使用fwprintf()等,但由于现有的巨大应用程序,这绝对是不可能的。 I could imagine writing a wrapper that uses these functions internally, but this would be tricky and very inefficient. 我可以想象编写一个在内部使用这些函数的包装器,但这会非常棘手并且非常低效。

So the best solution would be a UTF-8-aware replacement for the formatting functions of the Standard C Library. 因此,最好的解决方案是替换标准C库的格式化功能的UTF-8。

Currently I'm working on QNX 6.4, but replies for other operating systems. 目前我正在研究QNX 6.4,但回复了其他操作系统。 eg Linux, are also very welcome. 例如Linux,也非常受欢迎。

Well, once you ask printf to do intelligent padding of Unicode characters, you run into major problems. 好吧,一旦你要求printf对Unicode字符进行智能填充,就会遇到重大问题。 As they say, 像他们说的那样,

w͢͢͝h͡o͢͡ ̸͢k̵͟n̴͘ǫw̸̛s͘ ̀́w͘͢ḩ̵a҉̡͢t ̧̕h́o̵r͏̵rors̡ ̶͡͠lį̶e͟͟ ̶͝in͢ ͏t̕h̷̡͟e ͟͟d̛a͜r̕͡k̢̨ ͡h̴e͏a̷̢̡rt́͏ ̴̷͠ò̵̶f̸ u̧͘ní̛͜c͢͏o̷͏d̸͢e̡͝?͞ w͢͢͝h͡o͢͢͡͡k̵͟n̴͘ǫw̸̛s͘w̧̕a҉̡͢ţ̕ho̵r͏̵rors̡̡lį̶e͟͟͟͟in͢͢t̕h̷̡͟e͟͟d̛a͜r̕͡k̢̨̢̨h̴e͏a̷̢̡rt͏͏̴̷̵̶̸̸̷̧̧̧̛̛̛͘͘͘͜͜͜͠͏̷͏̡̡̡͝͝͝͞͞͞͞

  • How many Unicode characters are in Tääääßt ? Tääääßt有多少个Unicode字符? Well, it could be anywhere from 7 to 11, depending on how it's encoded. 好吧,它可以是7到11之间的任何地方,具体取决于它的编码方式。 Each ä can be written as U+00E4, which is one character, or it could be written as U+0061 U+0308, which is two characters. 每个ä可以写成U + 00E4,这是一个字符,或者它可以写成U + 0061 U + 0308,这是两个字符。 So your next hope is to count grapheme clusters. 所以你的下一个希望是计算字形集群。 (No, normalization won't make the problem go away.) (不,标准化不会使问题消失。)

  • But, how wide is a grapheme cluster? 但是,字形簇有多宽? Obviously, a is one column wide. 显然, a是一列宽。 U+200B should be zero columns wide, it's a "zero-width" space. U + 200B应为零列宽,这是一个“零宽度”空间。 Should each ひらがな be two columns wide? 每个ひらがな应该是两列宽吗? They usually are in terminal emulators. 它们通常位于终端仿真器中。 What happens when you format ひらがな as 7 columns, do you get "ひらが " , which adds a space, or do you get "ひらが" , which is only 6 columns? 将ひらがな格式化为7列时会发生什么,你会得到"ひらが " ,这会增加一个空格,还是得到"ひらが" ,这只是6列?

  • If you cut something up which mixes RTL and LTR text, should you reset the text direction afterwards? 如果您剪切混合了RTL和LTR文本的内容,您是否应该重新设置文本方向? What are you going to do? 你会怎样做? (Some terminal emulators, such as Apple's, support a mixture of left-to-right and right-to-left text.) (某些终端模拟器,例如Apple的,支持从左到右和从右到左文本的混合。)

  • What is your goal by truncating text? 截断文字的目的是什么? Are you trying to show the user a string in limited space, or are you trying to write a format that uses fixed-width fields? 您是在尝试向用户显示有限空间中的字符串,还是在尝试编写使用固定宽度字段的格式?

Basically, if you want to cut Unicode text into chunks, you shouldn't be doing it with something as simple as printf (or wprintf , which is quite possibly worse). 基本上,如果你想将Unicode文本剪切成块,你不应该使用像printf (或wprintf ,这可能更糟)这样简单的东西。 Use LibICU ( website ) to iterate over the breaks you want. 使用LibICU( 网站 )迭代您想要的休息时间。 Writing a UTF-8 aware version of printf is asking for all sorts of trouble that you don't want. 编写UTF-8识别版本的printf会要求您提供各种不需要的麻烦。

The following C99 code snippet defines the function u8printf where format specifiers such as %10s yield 10 utf-8 code points, that is characters rather than bytes. 以下C99代码片段定义了函数u8printf,其中格式说明符(如%10s)产生10个utf-8代码点,即字符而不是字节。 Don't forget to set the locale with setlocale(LC_ALL,"") somewhere before this routine is called. 在调用此例程之前,不要忘记在某处设置setlocale(LC_ALL,“”)的语言环境。 This works because the wprintf uses wchar_t internally. 这是有效的,因为wprintf在内部使用wchar_t。 You can define u8fprintf and u8sprintf in a similar way. 您可以用类似的方式定义u8fprintf和u8sprintf。 If you want to write this without C99 variable length arrays than a suitable combination of malloc/free is also possible. 如果你想在没有C99可变长度数组的情况下编写这个,那么也可以使用malloc / free的合适组合。

int u8printf(char *fmt,...){
    va_list ap;
    va_start(ap,fmt);
        int n=mbstowcs(0,fmt,0);
        if(n==-1) return -1;
        wchar_t wfmt[n+1];
        mbstowcs(wfmt,fmt,n+1);
        for(int m=128;m<=32768;m*=2){
            wchar_t wbuf[m];
            int r=vswprintf(wbuf,m,wfmt,ap);
            if(r!=-1) {
                char buf[m*4];
                wcstombs(buf,wbuf,m*4);
                fputs(buf,stdout);
                return r;
            }
        }
        return -1;
    va_end(ap);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM