简体   繁体   中英

How to convert wchar_t to multi-bytes char in C

I'm looking for a way to convert wchar_t to multi-bytes char, without using wctomb or any ready-made routine. I have to do that in C, not C++, and the interoperability doesn't matter here.

My goal is to print wchar byte by byte using the write syscall. For example, the 'é' character is equivalent to 0xe9 encoded into a wchar, and is equivalent to ff ff ff c3 ff ff ff a9 in its multi-bytes form. Ho can I switch from one form to the other?

Thanks in advance.

I'm looking for a way to convert wchar_t to multi-bytes char, without using wctomb or any ready-made routine

This is the same as conversion between any two encodings. First determine the encoding used to encode characters in source and destination, then translate characters from one encoding to another.

So first wchar_t - it's encoding is (or should be) constant and determined by your compiler and environment. So read about your environment and about your compiler. You specified Debian, using gcc then read gcc documentation and nowadays on linux wchar_t is meant to represent one UCS-4 "character" . Note that on windows wchar_t is UTF-16 .

Then determine the destination encoding, the encoding of the multi-byte string - it depends on locale . Read and parse LC_CTYPE locale, you might want read posix locale and about locale naming . Then because of without using any ready-made routine in the sad case when the locale doesn't specify codeset , you have to write your own platform-specific parser for locale specific files and infer the default character encoding for specific current locale (I am not really sure how it happens here, you have to find "the locale language category"). Pages like man 7 locale man 7 charsets look like a good read.

Then after determining the destination and source encodings, you need to write a routine that will translate one encoding to another. Because of without using any ready-made routine you don't want to use iconv , that means you have to write it yourself. That goes to reading specification of both encodings and what characters are represents by what codepoints in these encodings and then deciding how to translate each and every codepoint from one encoding to another.

All in all, another projects source code, like glibc source code or libiconv or libunistring might be sources of inspiration.

It's for a school project, so I guess is not that hard once you know the trick.

Most probably the multibyte encoding is UTF-8, unicode is dominating todays world. As such, you'll want to research how to convert a UTF-32 to UTF-8, which is actually a simple routine.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM