简体   繁体   中英

Why doesn't strlen() count the byte of the terminating NUL-character, when the NUL-character is defined to be part of a string?

I know that strlen() does not count the NUL-terminating character with. I really know that this is a fact. Thus, this question is NOT about asking for why strlen() might "presumably" not return the right string length, which is already asked and answered alot well here on StackOverflow, fe in this thread , or this one .

So lets go ahead to my question:

In ISO/IEC 9899:1990 (E); 7.1.1., is stated:

A string is a contiguous sequence of characters terminated by and including the first null character.

What is the reason, why strlen() deviate from this formed standard, and does not "want" to accept a string with its NUL-terminating character?

Why?

Because you would expect this pseudocode's assertion to hold true:

str1 = "foo"
str2 = "bar"
str3 = concatenate(str1, str2)

Assert strlen(str1) + strlen(s2) == strlen(str3)

If terminating '\0' was counted by strlen , above assertion would not hold, which would be much more of overall headache, than what the current C string behavior is. More importantly, it would in my opinion be quite unintuitive and illogical.

Not really an answer to your question, but consider this example:

char string[] = "string";
printf("sizeof: %zu\n", sizeof(string));
printf("strlen: %zu\n", strlen(string));

This prints

sizeof: 7
strlen: 6

So sizeof counts the \0 , but strlen doesn't.

Questions like this, that ask why a certain age-old decision was made one way and not another way, are hard to answer. I can say that it's perfectly obvious to me , anyway, that strlen should count just the real, "interesting" characters that are in the string, and ignore the \0 at the end that merely terminates it. I'm used to accounting for the \0 separately. I imagine it would have been considerably more of a nuisance overall if strlen had been defined the other way. But I can't prove this with convincing arguments, and I've been using strlen with its current definition for so long that I'm probably hopelessly biased; I might be saying "it's perfectly obvious to me that..." even if strlen 's definition were quite wrong.

Taking your doubt as a reasonable point we can state that: The C-string consists of two parts:

  1. the string's useful content ("the text");
  2. the null terminating character;

The null terminating character is purely a technical measure for determination of the end of the string by the C-originated library functions. Still, if one types a declaration:

char * str = "some string";

they logically would rather expect its length to be 11 which is as many as they can see in this statement. Hence the strlen() value yields only the length of the part 1. of the string.

There is a difference between the physical, stored representation of a C style string and the logical representation of a C style string.

The physical representation, how the string is actually stored in memory or other media includes the null character. The null character is included when discussing the physical representation because it take up an additional piece of storage. In order to be a C style string the null character must be stored.

However the logical representation of a string does not include the null character. The logical representation of a string includes only the text characters that the programmer is wanting to manipulate.

I suspect that the null character, a value of binary zero, was chosen because of the original ASCII character set defined a character value of zero as the NULL character. Part of the lower values among the various teletype control codes, it seems to be the least likely ASCII character that may appear in text. See ASCII Character Codes .

Another nice quality of using a binary zero as the string terminator is that is the value that represents logical false so iterating over a string is often a matter of incrementing an array index or incrementing a pointer while logical true since all characters other than the end of string indicator have a non-zero or logical true value.

Due to how close to the hardware that the C programming language is, the programmer needs to be concerned about both representations, the physical representation when allocating memory to store a string which includes the null character and the logical representation which is the string without the null character.

The various C style string manipulation functions in the Standard Library ( strlen() , strcpy() , etc.) are all designed around the logical representation of a C style string. They perform their actions by using the null character as not being part of the text but rather as a special indicator character which indicates the end of the string. However as a part of their operations they need to be aware of the null character and its use as a special symbol. For instance when strcpy() or strcat() are used to copy strings, they must also copy the null character that indicates the end of the string even though it is not part of the actual text of the logical representation.

This choice allows text strings to be stored as arrays of characters, as befits the hardware orientation and efficiency characteristics of C. There is no need to create an additional built in type for text strings and it fits well with the lean character of the C programming language.

C++ is able to provide the std::string because of being object oriented and having the additional facilities of the language that allows for objects to be created and managed. The C programming language, due to its simple syntax and lack of object oriented facilities does not have this convenience.

The problem with this approach is that the programmer needs to be aware of both the physical representation and the logical representation of text strings and be able to accommodate the needs of both when writing programs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM