简体   繁体   中英

Printing every character of a UTF8 string

I am new to Unicode/UTF8 representations of strings. I am trying to read a UTF8 encoded file, separate it with spaces and then print every character/code-point in every word (separated by spaces).

I was able to use wchar_t (I know it uses utf16 or utf32 (?) internally) for reading text from the file, printing it and writing it to another file. However, I was unable to use the wchar_t to get either a substring or traverse it element by element.

To solve for this, I used the ICU library from IBM. Code:

while (fgetws(readString, 1000, wifile) != NULL) {
        wprintf(L"String: %s\n", readString);
        //split string on the base of spaces.
        wchar_t *nextToken = NULL;
        wchar_t *token = wcstok_s(readString, L" ", &nextToken);
        UChar *utf8Token = (UChar *)token;
        u_printf("Token in UChar: %S\n", utf8Token);
        while (token != NULL) {
            printf("Hello.\n");
            fwprintf(wofileString, L"%ls and length: %d\n", token, wcslen(token));
            fwprintf(wofileString, L"UTF8 rep:%s and length: %d\n", utf8Token, u_strlen(utf8Token));
            int32_t counter = 0;
            for (counter = 0; counter < u_strlen(utf8Token);) {
                UChar32 ch;
                U8_NEXT(utf8Token, counter, u_strlen(utf8Token), ch);
                fwprintf(wofileString, L"Token[%d] = ", counter);
                if (ch < 127) {
                    printf("Less than 127.\n");
                    if (ch > 1) {
                        printf("Printing%d.\n", ch);
                        u_fprintf((UFILE *)wofileString, "%c\n", (UChar)ch);
                    }
                } else if (ch == CharacterIterator::DONE) {
                    printf("Done.\n");
                    u_fprintf((UFILE *)wofileString, "[CharacterIterator::DONE]\n");
                } else {
                    printf("More than 127.\n");
                    u_fprintf((UFILE *)wofileString, "[%X]\n", ch);
                }
            }
            token = wcstok_s(NULL, L" ", &nextToken);
            utf8Token = (UChar *)token;
            counter = 0;
        }
        fputws(L"Complete String: ", wofileString);
        fputws(readString, wofileString);
        fputws(L"\n", wofileString);
    }

This program always stops working when it gets to the part where the characters are printed.

My questions:
1. How can I print all the 'characters' in the input UTF8 string?
2. Is the conversion: UChar *utf8Token = (UChar *) token; even correct? Given that the internal representation for token is UTF16 or UTF32 ?
3. Where am I going wrong?
4. How do I get a substring of the string?

fwprintf(wofileString,… u_fprintf((UFILE *)wofileString,…

One of these two lines is wrong, depending on what wofileString actually is.

I'd recommend just using the u_… functions.

In fact, I'd just use u_printf("string", …) or u_printf_u(L"String", …) instead of fwprintf or fputws .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM