简体   繁体   English

使用C风格的字符串有哪些缺点?

[英]What are some of the drawbacks to using C-style strings?

I know that buffer overruns are one potential hazard to using C-style strings (char arrays). 我知道缓冲区溢出是使用C风格字符串(char数组)的一个潜在危险。 If I know my data will fit in my buffer, is it okay to use them anyway? 如果我知道我的数据适合我的缓冲区,是否可以使用它们? Are there other drawbacks inherent to C-style strings that I need to be aware of? 我还需要注意C风格字符串固有的其他缺点吗?

EDIT: Here's an example close to what I'm working on: 编辑:这是一个接近我正在做的事情的例子:

char buffer[1024];
char * line = NULL;
while ((line = fgets(fp)) != NULL) { // this won't compile, but that's not the issue
    // parse one line of command output here.
}

This code is taking data from a FILE pointer that was created using a popen("df") command. 此代码从使用popen("df")命令创建的FILE指针获取数据。 I'm trying to run Linux commands and parse their output to get information about the operating system. 我正在尝试运行Linux命令并解析其输出以获取有关操作系统的信息。 Is there anything wrong (or dangerous) with setting the buffer to some arbitrary size this way? 以这种方式将缓冲区设置为任意大小是否有任何错误(或危险)?

There are a few disadvantages to C strings: C字符串有一些缺点:

  1. Getting the length is a relatively expensive operation. 获得长度是一项相对昂贵的操作。
  2. No embedded nul characters are allowed. 不允许嵌入的nul字符。
  3. The signed-ness of chars is implementation defined. 字符的签名是实现定义的。
  4. The character set is implementation defined. 字符集是实现定义的。
  5. The size of the char type is implementation defined. char类型的大小是实现定义的。
  6. Have to keep track separately of how each string is allocated and so how it must be free'd, or even if it needs to be free'd at all. 必须分别跟踪每个字符串的分配方式,以及它必须如何被释放,或者即使它根本需要被释放。
  7. No way to refer to a slice of the string as another string. 无法将字符串切片称为另一个字符串。
  8. Strings are not immutable, meaning they must be synchronized separately. 字符串不是不可变的,这意味着它们必须单独同步。
  9. Strings cannot be manipulated at compile time. 字符串无法在编译时进行操作。
  10. Switch cases cannot be strings. 切换案例不能是字符串。
  11. The C preprocessor does not recognize strings in expressions. C预处理器无法识别表达式中的字符串。
  12. Cannot pass strings as template arguments (C++). 无法将字符串作为模板参数传递(C ++)。

C strings lack the following aspects of their C++ counterparts: C字符串缺少C ++对应的以下方面:

  • Automatic memory management: you have to allocate and free their memory manually. 自动内存管理:您必须手动分配和释放内存。
  • Extra capacity for concatenation efficiency: C++ strings often have a capacity greater than their size. 连接效率的额外容量:C ++字符串的容量通常大于其大小。 This allows increasing the size without many reallocations. 这允许在没有许多重新分配的情况下增加大小。
  • No embedded NULs: by definition a NUL character ends a C string; 没有嵌入的NUL:根据定义,NUL字符结束C字符串; C++ string keep an internal size counter so they don't need a special value to mark their end. C ++字符串保留一个内部大小计数器,因此它们不需要特殊值来标记它们的结尾。
  • Sensible comparison and assignment operators: even though comparison of C string pointers is permitted, it's almost always not what was intended. 明智的比较和赋值运算符:即使允许比较C字符串指针,它几乎总是不是预期的。 Similarly, assigning C string pointers (or passing them to functions) creates ownership ambiguities. 类似地,分配C字符串指针(或将它们传递给函数)会产生所有权歧义。

在许多应用中,不能在恒定时间内访问长度是一个严重的开销。

You may know that today 1024 bytes is enough to contain any input, but you don't know how things will change tomorrow or next year. 您可能知道,今天1024字节足以包含任何输入,但您不知道明天或明年的情况会如何变化。

If premature optimization is the root of all evil, magic numbers are the stem. 如果过早优化是所有邪恶的根源,魔术数字就是干。

如果需要,内存管理等需要增长字符串(字符数组),有点无聊重新发明。

没有办法将NUL字符(如果你需要它们)嵌入到C样式字符串中。

Well, to comment on your specific example, you don't know that the data returned by your call to df will fit into your buffer. 好吧,为了评论您的具体示例,您不知道调用df返回的数据是否适合您的缓冲区。 Never trust un-sanatized input into your application, even when it is supposedly from a known source like df. 永远不要相信未经过传真的输入到您的应用程序中,即使它应该来自像df这样的已知来源。

For example, if a program named 'df' is placed somewhere in your search path so that it is executed instead of the system df it could be used to exploit your buffer limit. 例如,如果名为“df”的程序放在搜索路径中的某个位置,以便执行它而不是系统df,则可以使用它来利用缓冲区限制。 Or if df is replaced by a malicious program. 或者如果df被恶意程序替换。

When reading input from a file use a function that lets you specify the maximum number of bytes to read. 从文件读取输入时,使用一个允许您指定要读取的最大字节数的函数。 Under OSX and Linux fgets() is actually defined as char *fgets(char *s, int size, FILE *stream); 在OSX和Linux下,fgets()实际上定义为char *fgets(char *s, int size, FILE *stream); so it would be safe to use on those systems. 所以在这些系统上使用是安全的。

当你有一个字节数组而不是一串字符时,字符编码问题往往浮出水面。

In your specific case, it's not the c-string that dangerous, so much as the reading an indeterminate amount of data into a fixed-size buffer. 在您的特定情况下,它不是危险的C字符串,而是将不确定数量的数据读入固定大小的缓冲区。 Don't ever use gets(char*) for example. 不要使用gets(char *)作为例子。

Looking at your example though, it doesn't seem at all correct - try this: 看看你的例子,它似乎没有正确 - 试试这个:

char buffer[1024];
char * line = NULL;
while ((line = fgets(buffer, sizeof(buffer), fp)) != NULL) {
    // parse one line of command output here.
}

This is a perfectly safe use of c-strings, although you'll have to deal with the possibility that line does not contain an entire line, but was rather truncated to 1023 characters (plus a null terminator). 这是对c字符串的完全安全使用,尽管你必须处理line不包含整行的可能性,但却被截断为1023个字符(加上一个空终止符)。

I think IT IS OKAY to use them, people've been using them for years. 我认为使用它们是可以的,人们已经使用它们多年了。 But I would rather use std::string if possible because 1) you don't have to be so cautious every time and can think about problems of your domain, instead of thinking that you need to add another parameter every time...memory management and that kinda stuff...it is just safer to code on a higher level... 2) there are probably some other small concerns which are not big deal but still...like people already mentioned...encoding, unicode...all those "related" kinda stuff people creating std::string thought of...:) 但是如果可能的话,我宁愿使用std :: string,因为1)你不必每次都这么谨慎,并且可以考虑你的域的问题,而不是认为你每次都需要添加另一个参数......内存管理和那些东西...在更高层次上编码更安全...... 2)可能还有一些其他小问题并不重要但仍然......就像人们已经提到的那样...编码,unicode ...所有那些“相关的”有点人们创建std :: string的想法...... :)

Update 更新

I worked on a project for half a year. 我在一个项目上工作了半年。 Somehow I was stupid enough to never compile in release mode before delivery....:) Well...luckily there was just one error I found after 3 hours. 不知何故,我愚蠢到永远不会在交付之前在发布模式下编译.... :)嗯...幸运的是我在3小时后发现了一个错误。 It was a very simple string buffer overrun. 这是一个非常简单的字符串缓冲区溢出。

这些天没有Unicode支持是足够的理由......

c strings have opportunities for misuse, due to the fact that that one has to scan the string to determine where it ends. c字符串有滥用的机会,因为必须扫描字符串以确定它的结束位置。

strlen - to find the length, scan the string, until you hit the NUL, or access protected memory strlen - 找到长度,扫描字符串,直到你点击NUL,或访问受保护的内存

strcat - has to scan to find the NUL, in order to determine where to begin concatenating. strcat - 必须扫描才能找到NUL,以确定从哪里开始连接。 There is no knowledge within ac string, to tell if there will be a buffer overrun or not. ac字符串中没有任何知识,无法判断是否存在缓冲区溢出。

c strings are risky, but generally faster than string objects. c字符串有风险,但通常比字符串对象快。

Imho, the hardest point of cstrings is the memory management, because you need to be carefully if you need to pass a copy of a cstring or if you can pass a literal to a function, ie. Imho,cstrings最难点的是内存管理,因为如果你需要传递一个cstring的副本或者你可以将一个文字传递给一个函数,你需要小心。 will the function free the passed string or will it keep a reference longer then for the function call. 函数是否会释放传递的字符串,或者它会为函数调用保留更长的引用。 The same applies to cstring return values. 这同样适用于cstring返回值。

So without big effort it is not possible to share cstring copys. 因此,如果不付出巨大努力,就无法共享cstring copys。 This ends in many cases with unnecessary copiess of the same cstring in the memory. 这在许多情况下以内存中相同cstring的不必要的copiess结束。

This question is not really have an answer. 这个问题真的没有答案。
If you writing in C what over options you have ? 如果你用C语写你有什么选择吗?
If you writing in C++ why are you asking ? 如果你用C ++写作,为什么要问? What is the reason not to use C++ primitives ? 不使用C ++原语的原因是什么?
The only reason i can think is: Linking C and C++ code and have char * somewhere in interfaces. 我能想到的唯一原因是:链接C和C ++代码并在接口中的某处使用char *。 It sometimes just easy to use char * instead doing conversion back and forward all the time (especially if it's really 'good' C++ code that have 3 different C++ string objects types). 它有时候很容易使用char *而不是一直进行转换(特别是如果它真的'好'的C ++代码有3种不同的C ++字符串对象类型)。

C strings, like many other aspects of C, give you plenty of room to hang yourself. 与C的许多其他方面一样,C字符串为您提供了充足的空间。 They are simple and fast, but unsafe in the situation where assumptions such as the null terminator can be violated or input can overrun the buffer. 它们简单快速,但在可能违反空终结符等假设或输入可能超出缓冲区的情况下不安全。 To do them reliably you have to observe fairly hygenic coding practices. 为了可靠地完成它们,你必须观察相当的卫生编码实践。

There used to be a saying that the canonical definition of a high-level language was "anything with better string handling than C". 曾经有一种说法,高级语言的规范定义是“比C更好的字符串处理”。

Another consideration is who will be maintaining your code? 另一个考虑因素是谁将维护您的代码? What about in two years? 两年后怎么样? Will that person be as comfortable with C-stlye strings as you are? 那个人会像你一样对C-stlye琴弦感到舒服吗? As the STL gets more mature, it seems like people will be increasingly more comfortable with with STL strings than with C-style strings. 随着STL越来越成熟,似乎人们对STL字符串的使用会比使用C风格的字符串更加舒适。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM