简体   繁体   中英

In C, how to print UTF-8 char if given its bytes in char variables?

If I have c1, c2 as char variables (such that c1c2 would be the byte sequences for the UTF-8 character), how do I create and print the UTF-8 character?

Similarly for the 3 and 4 byte UTF-8 characters?

I've been trying all kinds of approaches with mbstowcs() but I just can't get it to work.

I managed to write a working example.
When c1 is '\xce' and c2 is '\xb8' , the result is θ .
It turns out that I have to call setlocale before using mbstowcs .

#include <stdlib.h>
#include <stdio.h>
#include <locale.h>
 
int main()
{
   char* localeInfo = setlocale(LC_ALL, "en_US.utf8");
   printf("Locale information set to %s\n", localeInfo);
   
   const char c1 = '\xce';
   const char c2 = '\xb8';
   int byteCount = 2;

   char* mbS = (char*) malloc(byteCount + 1);
   mbS[0] = c1; 
   mbS[1] = c2; 
   mbS[byteCount] = 0; //null terminator
   printf("Directly using printf: %s\n", mbS);
   
   
   int requiredSize = mbstowcs(NULL, mbS, 0); 
   printf("Output size including null terminator is %d\n\n", requiredSize +1);
   
   wchar_t *wideOutput = (wchar_t *)malloc( (requiredSize +1) * sizeof( wchar_t ));
   
   int len = mbstowcs(wideOutput , mbS, requiredSize +1 ); 
   if(len == -1){
       printf("Failed conversion!");
   }else{
       printf("Converted %d character(s). Result: %ls\n", len, wideOutput );
   }
   return 0;
    
}

Output:

Locale information set to en_US.utf8
Directly using printf: θ
Output size including null terminator is 2

Converted 1 character(s). Result: θ

For 3 or 4 byte utf8 characters, one can use a similar approach.

If I have c1, c2 as char variables (such that c1c2 would be the byte sequences for the UTF-8 character), how do I create and print the UTF-8 character?

They are already an UTF-8 character. You would just print them.

putchar(c1);
putchar(c2);

It's up to your terminal or whatever device you are using to display the output to properly understand and render the UTF-8 encoding. This is unrelated to encoding used by your program and unrelated to wide characters.

Similarly for the 3 and 4 byte UTF-8 characters?

You would output them.


If your terminal or the device you are sending the bytes to does not understand UTF-8 encoding, then you have to convert the bytes to something the device understands. Typically, you would use an external library for that, like iconv . Alternatively, you could setlocale("C.utf-8") then convert your bytes to wchar_t , then setlocale("C.your_target_encoding") and then convert the bytes to that encoding or output the bytes with %ls . All %ls does (on common systems) is it converts the string back to multibyte and then outputs it. Wide stream outputting to terminal does the same, first converts, then outputs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM