简体   繁体   中英

How to count the number of multibyte characters?

I'd like to get 5 instead of 10 for the following program. Does anybody know how to fix the code to count the number of multibyte characters? Thanks.

/* vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8: */
#include <stdlib.h>
#include <string.h>
#include <wchar.h>
#include <locale.h>

size_t nchars(const char *s) {   
    size_t charlen, chars;
    mbstate_t mbs;

    chars = 0;
    memset(&mbs, 0, sizeof(mbs));
    while (
            (charlen = mbrlen(s, MB_CUR_MAX, &mbs)) != 0
            && charlen != (size_t)-1
            && charlen != (size_t)-2
            ) {
        s += charlen;
        chars++;
    }   

    return (chars);
}   

int main() {
    setlocale(LC_CTYPE, "en_US.utf8");
    char * text = "öçşğü";

    printf("%zu\n", nchars (text));

    return 0;
}
$ ./main.exe 
10

Secondary problem: you should initialize an object of type mbstate_t via the mbsinit function, not memcpy . The all-bytes-zero mbsinit is not guaranteed to represent an initial shift state, nor even any valid shift state.

The primary problem with your code revolves around the fact that it is analyzing a string literal, whose representation is determined at compile time based on the actual encoding of those characters in the source file, on their representation in the compiler's source character set, and on the execution character set chosen by the compiler. You cannot choose LC_CTYPE arbitrarily -- it has to be matched to the data for the mb conversion functions to work as intended.

C does not define a mechanism for a program to identify a locale whose LC_TYPE corresponds to the execution character set, nor does it even require such a locale to exist. Your compiler's documentation should describe the mapping between source characters and execution characters, however, possibly in terms of a locale or well-known encoding, and it may even describe a way for you to specify that. Your compiler's documentation may also describe a way for you to specify the encoding it should assume for source files.

Furthermore , you have an additional potential issue with Unicode, that there can be mismatch between what you, a human, consider a "character" and the Unicode characters with which it is represented. Generally, this involves characters bearing diacritical marks such as accents. Many of the more commonly-used of these have a single-character "composed" representation, but can also be represented as a sequence of a base character plus one or more combining characters.

mbrlen() is unlikely to distinguish between base and combining characters, so even without any encoding confusion, your observed result could arise from the characters being represented in decomposed form in the source files, or being transformed into that form by the compiler.

The bottom line is that your program depends on environmental and implementation characteristics that the standard does not specify, therefore it may behave differently with different implementations, as indeed seems to be the observation. Your particular observation could arise, for example, from the source file being encoded in UTF-8, the the compiler assuming it to be encoded in a single-byte encoding such as ISO-8859-1 instead, yet the compiler using UTF-8 for its execution character set.

Your approach might work without changes if you ensure that the compiler interprets the source file according to that file's actual encoding, and that it uses UTF-8 as its execution character set. Alternatively, in C11 or later you can ensure that the runtime encoding of that specific string is UTF-8 by using a UTF-8 literal, like so:

char * text = u8"öçşğü";

That takes care of only the execution-side encoding, however. You still need to match the source file encoding to the actual encoding expected by the compiler, and you can still be affected by differences between pre-composed and decomposed characters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM