简体   繁体   中英

passing unsigned char array to string functions

Say I have some utf8 encoded string. Inside it words are delimited using ";" . But each character (except ";" ) inside this string has utf8 value >128 . Say I store this string inside unsigned char array:

unsigned char buff[]="someutf8string;separated;with;";

Is it safe to pass this buff to strtok function? (If I just want to extracts words using ";" symbol).

My concern is that strtok (or also strcpy ) expect char pointers, but inside my string some values will have value > 128. So is this behaviour defined?

No, it is not safe -- but if it compiles it will almost certainly work as expected.

unsigned char buff[]="someutf8string;separated;with;";

This is fine; the standard specifically permits arrays of character type (including unsigned char ) to be initialized with a string literal. Successive bytes of the string literal initialize the elements of the array.

strtok(buff, ";")

This is a constraint violation , requiring a compile-time diagnostic. (That's about as close as the C standard gets to saying that something is illegal.)

The first parameter of strok is of type char* , but you're passing an argument of type unsigned char* . These two pointer types are not compatible, and there is no implicit conversion between them. A conforming compiler may reject your program if it contains a call like this (and, for example, gcc -std=c99 -pedantic-errors does reject it.)

Many C compilers are somewhat lax about strict enforcement of the standard's requirements. In many cases, compilers issue warnings for code that contains constraint violations -- which is perfectly valid. But once a compiler has diagnosed a constraint violation and proceeded to generate an executable, the behavior of that executable is not defined by the C standard.

As far as I know, any actual compiler that doesn't reject this call will generate code that behaves just as you expect it to. The pointer types char* and unsigned char* almost certainly have the same representation and are passed the same way as arguments, and the types char and unsigned char are explicitly required to have the same representation for non-negative values. Even for values exceeding CHAR_MAX , like the ones you're using, a compiler would have to go out of its way to generate misbehaving code. You could have problems on a system that doesn't use 2's-complement for signed integers, but yo're not likely to encounter such a system.

If you add an explicit cast:

strtok((char*)buff, ";")

removes the constraint violation and will probably silence any warning -- but the behavior is still strictly undefined.

In practice, though, most compilers try to treat char , signed char , and unsigned char almost interchangeably, partly to cater to code like yours, and partly because they'd have to go out of their way to do anything else.

According to the C11 Standard (ISO/IEC 9899:2011 §7.24.1 String Handling Conventions, ¶3, emphasis added):

For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value).

Note: this paragraph was not present in the C99 standard.

So I do not see a problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM