简体   繁体   中英

Does encoding affect the result of strstr() (and related functions)

Does character set encoding affects the result of strstr() function?

For example, I have read a data to "buf" and do this:

char *p = strstr (buf, "UNB");

I wonder whether the data is encoded in ASCII or others (eg EBCDIC) affects the result of this function? (Since "UNB" are different bit streams under different encoding ways...)

If yes, what's the default that is used for these function? (ASCII?)

Thanks!

您的字符串常量(“ UNB”)以源文件编码进行编码,因此它必须与缓冲区的编码匹配

The C functions like strstr operate on the raw char data, independently of the encoding. In this case, you potentially have two different encodings: the one the compiler used for the string literal, and the one your program used when filling buf . If these aren't the same, then the function may not work as expected.

With regards to the "default" encoding, there isn't one, at least as far as the standard is concerned; the ”basic execution character set“ is implementation defined. In practice, systems which don't use an encoding derived from ASCII (ISO 8859-1 seems the most common, at least here in Europe) are exceedingly rare. As for the encoding you get in buf , that depends on where the characters come from; if you're reading from an istream , it depends on the locale imbue d in the stream. In practice, however, again, almost all of these (UTF-8, ISO8859-x, etc.) are derived from ASCII, and are identical with ASCII for all of the characters in the basic execution character set (which includes all of the characters legal in traditional C). So for "UNB" , you're likely safe. (but for something like "üéâ" , you almost certainly aren't.)

Both string parameters must be the same encoding. With string literals the encoding of the C++ source (platform encoding). For Unicode, UTF-8 the function has another problem: Unicode has accented letters with diacritics but these can also be encoded as basic letter plus a combining diacritic symbol. é can be one letter [é] or two: [e] + [combining-´] . Normalisation exists.

For Java it is becoming usance (a very silent development) to explicitly set the source encoding to UTF-8. For C++ projects I am not aware of such conventions becoming widespread.

对于UTF-8编码的unicode字符, strstr应该可以正常工作。

使用此功能,数据以ASCII编码。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM