简体   繁体   中英

Looking for UTF8-aware formatting functions like printf(), etc

I discovered an interesting problem when processing UTF-8 strings containing non-ASCII chars with C standard library formatting functions like sprintf():

The functions of the printf() family are not aware of utf-8 and process everything based on the number of bytes, not chars. Therefore the formatting is incorrect.

Simple example:

#include <stdio.h>

int main(int argc, char *argv[])
{
    const char* testMsg = "Tääääßt";
    char buf[1024];
    int len;

    sprintf(buf, "|%7.7s|", testMsg);
    len = strlen(buf);
    printf("Result=\"%s\", len=%d", buf, len);

    return 0;
}

The result is:

 Result="|Täää|", len=7

Most probably some of you will recommand to convert the application from char to wchar_t and use fwprintf(), etc., but that's absolutely impossible because of huge existing applications. I could imagine writing a wrapper that uses these functions internally, but this would be tricky and very inefficient.

So the best solution would be a UTF-8-aware replacement for the formatting functions of the Standard C Library.

Currently I'm working on QNX 6.4, but replies for other operating systems. eg Linux, are also very welcome.

Well, once you ask printf to do intelligent padding of Unicode characters, you run into major problems. As they say,

w͢͢͝h͡o͢͡ ̸͢k̵͟n̴͘ǫw̸̛s͘ ̀́w͘͢ḩ̵a҉̡͢t ̧̕h́o̵r͏̵rors̡ ̶͡͠lį̶e͟͟ ̶͝in͢ ͏t̕h̷̡͟e ͟͟d̛a͜r̕͡k̢̨ ͡h̴e͏a̷̢̡rt́͏ ̴̷͠ò̵̶f̸ u̧͘ní̛͜c͢͏o̷͏d̸͢e̡͝?͞

  • How many Unicode characters are in Tääääßt ? Well, it could be anywhere from 7 to 11, depending on how it's encoded. Each ä can be written as U+00E4, which is one character, or it could be written as U+0061 U+0308, which is two characters. So your next hope is to count grapheme clusters. (No, normalization won't make the problem go away.)

  • But, how wide is a grapheme cluster? Obviously, a is one column wide. U+200B should be zero columns wide, it's a "zero-width" space. Should each ひらがな be two columns wide? They usually are in terminal emulators. What happens when you format ひらがな as 7 columns, do you get "ひらが " , which adds a space, or do you get "ひらが" , which is only 6 columns?

  • If you cut something up which mixes RTL and LTR text, should you reset the text direction afterwards? What are you going to do? (Some terminal emulators, such as Apple's, support a mixture of left-to-right and right-to-left text.)

  • What is your goal by truncating text? Are you trying to show the user a string in limited space, or are you trying to write a format that uses fixed-width fields?

Basically, if you want to cut Unicode text into chunks, you shouldn't be doing it with something as simple as printf (or wprintf , which is quite possibly worse). Use LibICU ( website ) to iterate over the breaks you want. Writing a UTF-8 aware version of printf is asking for all sorts of trouble that you don't want.

The following C99 code snippet defines the function u8printf where format specifiers such as %10s yield 10 utf-8 code points, that is characters rather than bytes. Don't forget to set the locale with setlocale(LC_ALL,"") somewhere before this routine is called. This works because the wprintf uses wchar_t internally. You can define u8fprintf and u8sprintf in a similar way. If you want to write this without C99 variable length arrays than a suitable combination of malloc/free is also possible.

int u8printf(char *fmt,...){
    va_list ap;
    va_start(ap,fmt);
        int n=mbstowcs(0,fmt,0);
        if(n==-1) return -1;
        wchar_t wfmt[n+1];
        mbstowcs(wfmt,fmt,n+1);
        for(int m=128;m<=32768;m*=2){
            wchar_t wbuf[m];
            int r=vswprintf(wbuf,m,wfmt,ap);
            if(r!=-1) {
                char buf[m*4];
                wcstombs(buf,wbuf,m*4);
                fputs(buf,stdout);
                return r;
            }
        }
        return -1;
    va_end(ap);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM