简体   繁体   English

C字符串混乱

[英]C strings confusion

I'm learning C right now and got a bit confused with character arrays - strings. 我现在正在学习C并且对字符数组 - 字符串感到困惑。

char name[15]="Fortran";

No problem with this - its an array that can hold (up to?) 15 chars 没问题 - 它的数组可以容纳(最多?)15个字符

char name[]="Fortran";

C counts the number of characters for me so I don't have to - neat! C计算我的字符数,所以我没有 - 整洁!

char* name;

Okay. 好的。 What now? 现在怎么办? All I know is that this can hold an big number of characters that are assigned later (eg: via user input), but 我所知道的是,这可以容纳后来分配的大量字符(例如:通过用户输入),但是

  • Why do they call this a char pointer? 为什么他们称之为char指针? I know of pointers as references to variables 我知道指针作为变量的引用
  • Is this an "excuse"? 这是“借口”吗? Does this find any other use than in char*? 这是否找到除char *之外的任何其他用途?
  • What is this actually? 这究竟是什么? Is it a pointer? 它是指针吗? How do you use it correctly? 你如何正确使用它?

thanks in advance, lamas 提前谢谢,喇嘛

I think this can be explained this way, since a picture is worth a thousand words... 我认为这可以用这种方式解释,因为一张图片胜过千言万语......

We'll start off with char name[] = "Fortran" , which is an array of chars, the length is known at compile time, 7 to be exact, right? 我们将从char name[] = "Fortran" ,这是一个字符数组,长度在编译时已知,确切地说是7,对吧? Wrong! 错误! it is 8, since a '\\0' is a nul terminating character, all strings have to have that. 它是8,因为'\\ 0'是一个空终止字符,所有字符串都必须有。

char name[] = "Fortran";
+======+     +-+-+-+-+-+-+-+--+
|0x1234|     |F|o|r|t|r|a|n|\0|
+======+     +-+-+-+-+-+-+-+--+

At link time, the compiler and linker gave the symbol name a memory address of 0x1234. 在链接时,编译器和链接器为符号name提供了0x1234的内存地址。 Using the subscript operator, ie name[1] for example, the compiler knows how to calculate where in memory is the character at offset, 0x1234 + 1 = 0x1235, and it is indeed 'o'. 使用下标运算符,例如name[1] ,编译器知道如何计算内存中偏移处的字符,0x1234 + 1 = 0x1235,并且它确实是'o'。 That is simple enough, furthermore, with the ANSI C standard, the size of a char data type is 1 byte, which can explain how the runtime can obtain the value of this semantic name[cnt++] , assuming cnt is an int eger and has a value of 3 for example, the runtime steps up by one automatically, and counting from zero, the value of the offset is 't'. 这是很简单的,此外,与ANSI C标准,一个大小char数据类型是1个字节,其可以解释运行时可以如何获得该语义值name[cnt++]假设cntint埃格尔并具有例如,值为3,运行时自动向上逐步递增,从零开始计数,偏移量的值为“t”。 This is simple so far so good. 到目前为止这很简单。

What happens if name[12] was executed? 如果name[12]被执行会怎么样? Well, the code will either crash, or you will get garbage, since the boundary of the array is from index/offset 0 (0x1234) up to 8 (0x123B). 好吧,代码会崩溃,或者你会得到垃圾,因为数组的边界是从索引/偏移0(0x1234)到8(0x123B)。 Anything after that does not belong to name variable, that would be called a buffer overflow! 之后的任何东西都不属于name变量,这将被称为缓冲区溢出!

The address of name in memory is 0x1234, as in the example, if you were to do this: 内存中的name地址为0x1234,如示例所示,如果您这样做:

printf("The address of name is %p\n", &name);

Output would be:
The address of name is 0x00001234

For the sake of brevity and keeping with the example, the memory addresses are 32bit, hence you see the extra 0's. 为了简洁和保持示例,内存地址是32位,因此您可以看到额外的0。 Fair enough? 很公平? Right, let's move on. 对,让我们继续吧。

Now on to pointers... char *name is a pointer to type of char .... 现在指向... char *name是指向char类型的指针....

Edit: And we initialize it to NULL as shown Thanks Dan for pointing out the little error... 编辑:我们将它初始化为NULL如图所示感谢Dan指出小错误...

char *name = (char*)NULL;
+======+     +======+ 
|0x5678| ->  |0x0000|    ->    NULL
+======+     +======+

At compile/link time, the name does not point to anything, but has a compile/link time address for the symbol name (0x5678), in fact it is NULL , the pointer address of name is unknown hence 0x0000. 在编译/链接时, name不指向任何内容,但是具有符号name的编译/链接时间地址(0x5678),实际上它是NULLname的指针地址是未知的,因此是0x0000。

Now, remember , this is crucial, the address of the symbol is known at compile/link time, but the pointer address is unknown, when dealing with pointers of any type 现在,请记住这是至关重要的,符号的地址在编译/链接时是已知的,但在处理任何类型的指针时指针地址是未知的

Suppose we do this: 假设我们这样做:

name = (char *)malloc((20 * sizeof(char)) + 1);
strcpy(name, "Fortran");

We called malloc to allocate a memory block for 20 bytes, no, it is not 21, the reason I added 1 on to the size is for the '\\0' nul terminating character. 我们调用malloc为20个字节分配一个内存块,不,它不是21,我加上1的大小的原因是'\\ 0'nul终止字符。 Suppose at runtime, the address given was 0x9876, 假设在运行时,给出的地址是0x9876,

char *name;
+======+     +======+          +-+-+-+-+-+-+-+--+
|0x5678| ->  |0x9876|    ->    |F|o|r|t|r|a|n|\0|
+======+     +======+          +-+-+-+-+-+-+-+--+

So when you do this: 所以当你这样做时:

printf("The address of name is %p\n", name);
printf("The address of name is %p\n", &name);

Output would be:
The address of name is 0x00005678
The address of name is 0x00009876

Now, this is where the illusion that ' arrays and pointers are the same comes into play here ' 现在,这就是“ 阵列和指针相同的幻觉在这里发挥作用

When we do this: 当我们这样做时:

char ch = name[1];

What happens at runtime is this: 运行时会发生什么:

  1. The address of symbol name is looked up 查找符号name的地址
  2. Fetch the memory address of that symbol, ie 0x5678. 获取该符号的内存地址,即0x5678。
  3. At that address, contains another address, a pointer address to memory and fetch it, ie 0x9876 在该地址处,包含另一个地址,指向存储器的指针地址并获取它,即0x9876
  4. Get the offset based on the subscript value of 1 and add it onto the pointer address, ie 0x9877 to retrieve the value at that memory address, ie 'o' and is assigned to ch . 根据下标值1获取偏移量并将其添加到指针地址,即0x9877,以检索该存储器地址的值,即“o”并分配给ch

That above is crucial to understanding this distinction, the difference between arrays and pointers is how the runtime fetches the data, with pointers, there is an extra indirection of fetching. 上面的内容对于理解这种区别至关重要,数组和指针之间的区别在于运行时如何使用指针获取数据,还有一个额外的取向间接。

Remember , an array of type T will always decay into a pointer of the first element of type T . 请记住T类型的数组总是会衰减为 T类型 的第一个元素的指针

When we do this: 当我们这样做时:

char ch = *(name + 5);
  1. The address of symbol name is looked up 查找符号name的地址
  2. Fetch the memory address of that symbol, ie 0x5678. 获取该符号的内存地址,即0x5678。
  3. At that address, contains another address, a pointer address to memory and fetch it, ie 0x9876 在该地址处,包含另一个地址,指向存储器的指针地址并获取它,即0x9876
  4. Get the offset based on the value of 5 and add it onto the pointer address, ie 0x987A to retrieve the value at that memory address, ie 'r' and is assigned to ch . 获取基于值5的偏移量并将其添加到指针地址,即0x987A以检索该存储器地址处的值,即“r”并分配给ch

Incidentally, you can also do that to the array of chars also... 顺便说一下,你也可以对字符数组这样做...

Further more, by using subscript operators in the context of an array ie char name[] = "..."; 此外,通过在数组的上下文中使用下标运算符,即char name[] = "..."; and name[subscript_value] is really the same as *(name + subscript_value). name[subscript_value]实际上与*(name + subscript_value)相同。 ie

name[3] is the same as *(name + 3)

And since the expression *(name + subscript_value) is commutative , that is in the reverse, 因为表达式*(name + subscript_value)可交换的 ,所以相反,

*(subscript_value + name) is the same as *(name + subscript_value)

Hence, this explains why in one of the answers above you can write it like this ( despite it, the practice is not recommended even though it is quite legitimate! ) 因此,这解释了为什么在上面的一个答案中你可以这样写( 尽管如此,即使它是非常合理的,也不推荐这种做法!

3[name]

Ok, how do I get the value of the pointer? 好的,我如何获得指针的值? That is what the * is used for, Suppose the pointer name has that pointer memory address of 0x9878, again, referring to the above example, this is how it is achieved: 这就是*的用途,假设指针name指针内存地址为0x9878,再次参考上面的例子,这就是它的实现方式:

char ch = *name;

This means, obtain the value that is pointed to by the memory address of 0x9878, now ch will have the value of 'r'. 这意味着,获取0x9878的内存地址所指向的值,现在ch将具有值'r'。 This is called dereferencing. 这称为解除引用。 We just dereferenced a name pointer to obtain the value and assign it to ch . 我们只是取消引用一个name指针来获取值并将其分配给ch

Also, the compiler knows that a sizeof(char) is 1, hence you can do pointer increment/decrement operations like this 此外,编译器知道sizeof(char)为1,因此您可以像这样执行指针递增/递减操作

*name++;
*name--;

The pointer automatically steps up/down as a result by one. 指针会自动向上/向下逐步上升/下降。

When we do this, assuming the pointer memory address of 0x9878: 当我们这样做时,假设指针内存地址为0x9878:

char ch = *name++;

What is the value of *name and what is the address, the answer is, the *name will now contain 't' and assign it to ch , and the pointer memory address is 0x9879. * name的值是什么,地址是什么,答案是, *name现在包含't'并将其分配给ch ,指针存储器地址是0x9879。

This where you have to be careful also, in the same principle and spirit as to what was stated earlier in relation to the memory boundaries in the very first part (see 'What happens if name[12] was executed' in the above) the results will be the same, ie code crashes and burns! 在这里你必须要小心,与前面关于内存边界的内容相同的原则和精神(参见上文中“如果名称[12]被执行时会发生什么”)结果将是相同的,即代码崩溃和烧伤!

Now, what happens if we deallocate the block of memory pointed to by name by calling the C function free with name as the parameter, ie free(name) : 现在,如果我们通过以name作为参数调用C函数free来解除分配name所指向的内存块,即free(name)

+======+     +======+ 
|0x5678| ->  |0x0000|    ->    NULL
+======+     +======+

Yes, the block of memory is freed up and handed back to the runtime environment for use by another upcoming code execution of malloc . 是的,内存块被释放并传回运行时环境,供另一个即将发布的malloc代码执行使用。

Now, this is where the common notation of Segmentation fault comes into play, since name does not point to anything, what happens when we dereference it ie 现在,这是分段错误的常用符号发挥作用的地方,因为name不指向任何东西,当我们取消引用它时会发生什么,即

char ch = *name;

Yes, the code will crash and burn with a 'Segmentation fault', this is common under Unix/Linux. 是的,代码将崩溃并以“分段故障”刻录,这在Unix / Linux下很常见。 Under windows, a dialog box will appear along the lines of 'Unrecoverable error' or 'An error has occurred with the application, do you wish to send the report to Microsoft?'....if the pointer has not been malloc d and any attempt to dereference it, is guaranteed to crash and burn. 在Windows下,将出现一个对话框,其中包含“不可恢复的错误”或“应用程序发生错误,您是否希望将报告发送给Microsoft?”....如果指针不是malloc d并且任何取消引用它的尝试都会保证崩溃和燃烧。

Also: remember this, for every malloc there is a corresponding free , if there is no corresponding free , you have a memory leak in which memory is allocated but not freed up. 另外:记住这一点,对于每个malloc都有一个相应的free ,如果没有相应的free ,你有一个内存泄漏,其中分配了内存但没有释放。

And there you have it, that is how pointers work and how arrays are different to pointers, if you are reading a textbook that says they are the same, tear out that page and rip it up! 而且你有它,这就是指针如何工作以及数组如何与指针不同,如果你正在阅读一本说它们相同的教科书,那就撕下那个页面然后撕掉它! :) :)

I hope this is of help to you in understanding pointers. 我希望这有助于你理解指针。

That is a pointer. 那是一个指针。 Which means it is a variable that holds an address in memory. 这意味着它是一个在内存中保存地址的变量。 It "points" to another variable. 它“指向”另一个变量。

It actually cannot - by itself - hold large amounts of characters. 它实际上不能 - 本身 - 持有大量的字符。 By itself, it can hold only one address in memory. 它本身只能在内存中保存一个地址。 If you assign characters to it at creation it will allocate space for those characters, and then point to that address. 如果在创建时为其分配字符,它将为这些字符分配空间,然后指向该地址。 You can do it like this: 你可以这样做:

char* name = "Mr. Anderson";

That is actually pretty much the same as this: 这实际上与此基本相同:

char name[] = "Mr. Anderson";

The place where character pointers come in handy is dynamic memory. 字符指针派上用场的地方是动态记忆。 You can assign a string of any length to a char pointer at any time in the program by doing something like this: 您可以通过执行以下操作,随时在程序中为char指针指定任意长度的字符串:

char *name;
name = malloc(256*sizeof(char));
strcpy(name, "This is less than 256 characters, so this is fine.");

Alternately, you can assign to it using the strdup() function, like this: 或者,您可以使用strdup()函数为其分配,如下所示:

char *name;
name = strdup("This can be as long or short as I want.  The function will allocate enough space for the string and assign return a pointer to it.  Which then gets assigned to name");

If you use a character pointer this way - and assign memory to it, you have to free the memory contained in name before reassigning it. 如果以这种方式使用字符指针 - 并为其分配内存,则必须在重新分配之前释放名称中包含的内存。 Like this: 像这样:

if(name)
    free(name);
name = 0;

Make sure to check that name is, in fact, a valid point before trying to free its memory. 在尝试释放内存之前,请确保检查该名称实际上是一个有效点。 That's what the if statement does. 这就是if语句的作用。

The reason you see character pointers get used a whole lot in C is because they allow you to reassign the string with a string of a different size. 您看到字符指针在C中被大量使用的原因是因为它们允许您使用不同大小的字符串重新分配字符串。 Static character arrays don't do that. 静态字符数组不会这样做。 They're also easier to pass around. 他们也更容易传球。

Also, character pointers are handy because they can be used to point to different statically allocated character arrays. 此外,字符指针很方便,因为它们可用于指向不同的静态分配字符数组。 Like this: 像这样:

char *name;

char joe[] = "joe";
char bob[] = "bob";

name = joe;

printf("%s", name);

name = bob;
printf("%s", name);

This is what often happens when you pass a statically allocated array to a function taking a character pointer. 这是将静态分配的数组传递给带有字符指针的函数时经常发生的情况。 For instance: 例如:

void strcpy(char *str1, char *str2);

If you then pass that: 如果你然后传递:

char buffer[256];
strcpy(buffer, "This is a string, less than 256 characters.");

It will manipulate both of those through str1 and str2 which are just pointers that point to where buffer and the string literal are stored in memory. 它将通过str1和str2操纵这两者,它们只是指向缓冲区和字符串文字存储在内存中的指针。

Something to keep in mind when working in a function. 在函数中工作时要记住的事情。 If you have a function that returns a character pointer, don't return a pointer to a static character array allocated in the function. 如果您有一个返回字符指针的函数,请不要返回指向函数中分配的静态字符数组的指针。 It will go out of scope and you'll have issues. 它将超出范围,你会遇到问题。 Repeat, don't do this: 重复一遍,不要这样做:

char *myFunc() {
    char myBuf[64];
    strcpy(myBuf, "hi");
    return myBuf;
}

That won't work. 那不行。 You have to use a pointer and allocate memory (like shown earlier) in that case. 在这种情况下,您必须使用指针并分配内存(如前所示)。 The memory allocated will persist then, even when you pass out of the functions scope. 分配的内存将保持不变,即使您传出函数范围也是如此。 Just don't forget to free it as previously mentioned. 只是不要忘记如前所述释放它。

This ended up a bit more encyclopedic than I'd intended, hope its helpful. 这最终比我想要的更加百科全书,希望它有用。

Editted to remove C++ code. 编辑删除C ++代码。 I mix the two so often, I sometimes forget. 我经常把两者混在一起,我有时会忘记。

char* name is just a pointer. char * name只是一个指针。 Somewhere along the line memory has to be allocated and the address of that memory stored in name . 沿线存储器的某处必须分配存储器名称的存储器地址。

  • It could point to a single byte of memory and be a "true" pointer to a single char. 它可以指向单个字节的内存,并且是指向单个字符的“真实”指针。
  • It could point to a contiguous area of memory which holds a number of characters. 它可以指向一个连续的内存区域,它包含许多字符。
  • If those characters happen to end with a null terminator, low and behold you have a pointer to a string. 如果这些字符恰好以null终结符结束,那么你可以看到一个指向字符串的指针。

In C a string is actually just an array of characters, as you can see by the definition. 在C中,字符串实际上只是一个字符数组,您可以从定义中看到。 However, superficially, any array is just a pointer to its first element, see below for the subtle intricacies. 然而,从表面上看,任何数组都只是指向其第一个元素的指针,请参阅下面的细微复杂性。 There is no range checking in C, the range you supply in the variable declaration has only meaning for the memory allocation for the variable. 在C中没有范围检查,您在变量声明中提供的范围仅对变量的内存分配有意义。

a[x] is the same as *(a + x) , ie dereference of the pointer a incremented by x. a[x]*(a + x) ,即指针a的解引用增加x。

if you used the following: 如果您使用以下内容:

char foo[] = "foobar";
char bar = *foo;

bar will be set to 'f' 栏将设为'f'

To stave of confusion and avoid misleading people, some extra words on the more intricate difference between pointers and arrays, thanks avakar: 为了避免混淆并避免误导人们,在指针和数组之间更复杂的差异上有一些额外的话,感谢avakar:

In some cases a pointer is actually semantically different from an array, a (non-exhaustive) list of examples: 在某些情况下,指针实际上在语义上与数组不同,这是一个(非详尽的)示例列表:

//sizeof
sizeof(char*) != sizeof(char[10])

//lvalues
char foo[] = "foobar";
char bar[] = "baz";
char* p;
foo = bar; // compile error, array is not an lvalue
p = bar; //just fine p now points to the array contents of bar

// multidimensional arrays
int baz[2][2];
int* q = baz; //compile error, multidimensional arrays can not decay into pointer
int* r = baz[0]; //just fine, r now points to the first element of the first "row" of baz
int x = baz[1][1];
int y = r[1][1]; //compile error, don't know dimensions of array, so subscripting is not possible
int z = r[1]: //just fine, z now holds the second element of the first "row" of baz

And finally a fun bit of trivia; 最后是一段有趣的琐事; since a[x] is equivalent to *(a + x) you can actually use eg '3[a]' to access the fourth element of array a. 因为a[x]等价于*(a + x)你实际上可以使用例如'3 [a]'来访问数组a的第四个元素。 Ie the following is perfectly legal code, and will print 'b' the fourth character of string foo. 即以下是完全合法的代码,并将'b'打印为字符串foo的第四个字符。

#include <stdio.h>

int main(int argc, char** argv) {
  char foo[] = "foobar";

  printf("%c\n", 3[foo]);

  return 0;
}

char *name , on it's own, can't hold any characters . char *name ,就其本身而言, 不能包含任何字符 This is important. 这个很重要。

char *name just declares that name is a pointer (that is, a variable whose value is an address) that will be used to store the address of one or more characters at some point later in the program. char *name只声明name是一个指针(即一个值为地址的变量),它将用于在程序后面的某个时刻存储一个或多个字符的地址。 It does not, however, allocate any space in memory to actually hold those characters, nor does it guarantee that name even contains a valid address. 但是,它不会在内存中分配任何空间来实际保存这些字符,也不保证name甚至包含有效地址。 In the same way, if you have a declaration like int number there is no way to know what the value of number is until you explicitly set it. 同样,如果你有一个类似int number的声明,那么在你明确设置它之前,无法知道number的值是什么。

Just like after declaring the value of an integer, you might later set its value ( number = 42 ), after declaring a pointer to char, you might later set its value to be a valid memory address that contains a character -- or sequence of characters -- that you are interested in. 就像声明一个整数的值一样,稍后你可以设置它的值( number = 42 ),在声明一个指向char的指针之后,你可能稍后将其值设置为包含一个字符的有效内存地址 - 或者序列人物 - 你感兴趣的。

It is confusing indeed. 这确实令人困惑。 The important thing to understand and distinguish is that char name[] declares array and char* name declares pointer. 理解和区分的重要事情是char name[]声明数组和char* name声明指针。 The two are different animals. 这两个是不同的动物。

However, array in C can be implicitly converted to pointer to its first element. 但是,C中的数组可以隐式转换为指向其第一个元素的指针。 This gives you ability to perform pointer arithmetic and iterate through array elements (it does not matter elements of what type, char or not). 这使您能够执行指针运算并遍历数组元素(无论是什么类型的元素,无论是否为char )。 As @which mentioned, you can use both, indexing operator or pointer arithmetic to access array elements. 正如@which所提到的,您可以使用索引运算符或指针算法来访问数组元素。 In fact, indexing operator is just a syntactic sugar (another representation of the same expression) for pointer arithmetic. 实际上,索引运算符只是指针运算的一种语法糖(同一表达式的另一种表示)。

It is important to distinguish difference between array and pointer to first element of array. 将数组和指针之间的差异区分为数组的第一个元素非常重要。 It is possible to query size of array declared as char name[15] using sizeof operator: 可以使用sizeof运算符查询声明为char name[15]的数组的sizeof

char name[15] = { 0 };
size_t s = sizeof(name);
assert(s == 15);

but if you apply sizeof to char* name you will get size of pointer on your platform (ie 4 bytes): 但是如果你将sizeof应用于char* name你将获得平台上指针的大小(即4个字节):

char* name = 0;
size_t s = sizeof(name);
assert(s == 4); // assuming pointer is 4-bytes long on your compiler/machine

Also, the two forms of definitions of arrays of char elements are equivalent: 此外,char元素数组的两种形式的定义是等效的:

char letters1[5] = { 'a', 'b', 'c', 'd', '\0' };
char letters2[5] = "abcd"; /* 5th element implicitly gets value of 0 */

The dual nature of arrays, the implicit conversion of array to pointer to its first element, in C (and also C++) language, pointer can be used as iterator to walk through array elements: 数组的双重性质,数组到指向其第一个元素的指针的隐式转换,在C(以及C ++)语言中,指针可以用作遍历数组元素的迭代器:

/ *skip to 'd' letter */
char* it = letters1;
for (int i = 0; i < 3; i++)
    it++;

One is an actual array object and the other is a reference or pointer to such an array object. 一个是实际的数组对象,另一个是指向这种数组对象的引用指针

The thing that can be confusing is that both have the address of the first character in them, but only because one address is the first character and the other address is a word in memory that contains the address of the character. 可能令人困惑的是,两者都有第一个字符的地址,但只是因为一个地址第一个字符而另一个地址是内存中包含字符地址的字。

The difference can be seen in the value of &name . 可以在&name的值中看到差异。 In the first two cases it is the same value as just name , but in the third case it is a different type called pointer to pointer to char , or **char , and it is the address of the pointer itself. 在前两种情况下,它与name只是相同的值,但在第三种情况下,它是一个不同的类型,称为指向char的指针 ,或**char ,它是指针本身的地址。 That is, it is a double-indirect pointer. 也就是说,它是一个双间接指针。

#include <stdio.h>

char name1[] = "fortran";
char *name2 = "fortran";

int main(void) {
    printf("%lx\n%lx %s\n", (long)name1, (long)&name1, name1);
    printf("%lx\n%lx %s\n", (long)name2, (long)&name2, name2);
    return 0;
}
Ross-Harveys-MacBook-Pro:so ross$ ./a.out
100001068
100001068 fortran
100000f58
100001070 fortran

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM