简体繁体 English

用C ++设计字符串类

[英]Designing a string class in C++

原文 2010-08-30 11:00:00 0 7 c++/ string

I need to design (and code, at some point) a "customized" string class in C++. 我需要在C ++中设计（并在某些时候编写代码）“自定义”字符串类。 I was wondering if you could please let me know about any documentation and design issues, primarily, and potential pitfalls I should be aware of. 我想知道你能不能让我知道任何文件和设计问题，主要是我应该注意的潜在缺陷。 Links are very welcome, as are the identification of problems (if any) with current string libs (Qstring, std::string, and the others). 非常欢迎链接，以及使用当前字符串库（Qstring，std :: string和其他）识别问题（如果有）。

Thank you. 谢谢。

7 个解决方案

Despite the critics, I think this is a valid question. 尽管有批评者，我认为这是一个有效的问题。

The std::string is not a panacea. std::string不是灵丹妙药。 It looks like someone took the class from a pure-OO and dumped it in C++, which is probably the case. 看起来有人从纯OO中取出类并将其转储到C ++中，这可能就是这种情况。

Advice 1: Prefer non-member non-friend methods 建议1：首选非会员非朋友方法

Now that this is said, in this hour of internationalization, I would certainly advise you to design a class that would support Unicode . 既然如此，在国际化的这个时刻，我肯定会建议你设计一个支持Unicode的类。 And I do say Unicode , not UTF-8 or UTF-16 . 我说的是Unicode ，而不是UTF-8或UTF-16 。 It's ill-fitting (I think) to devise a class that would contain the data in a given encoding. 我认为设计一个包含给定编码数据的类是不合适的（我认为）。 You can provide methods to then output the information in various formats. 您可以提供方法，然后以各种格式输出信息。

Advice 2: Support Unicode 建议2：支持Unicode

Then, there is a number of points on the memory allocation schemes: 那么，内存分配方案有很多要点：

Small String Optimization: the class contains pre-allocated space for a few characters (a dozen or two), and thus avoid heap allocation for those 小字符串优化：该类包含几个字符（十二个或两个）的预分配空间，从而避免为那些字符分配堆
Copy On Write: the various strings share a buffer so that copy is cheap, when one string needs to modify its content, it copies the buffer if it's not the sole owner --> the issue is that multithreading introduces overhead here and it's been showed that for a general purpose technic this overhead could dwarf the actual copying cost Copy On Write：各种字符串共享一个缓冲区以便复制很便宜，当一个字符串需要修改其内容时，如果它不是唯一的所有者，它会复制缓冲区 - >问题是多线程在这里引入了开销并且已经显示对于通用技术而言，这种开销可能使实际的复制成本相形见绌
Immutability: "new" languages such as Java , C# or Python use immutable strings. 不可变性： Java ， C#或Python等“新”语言使用不可变字符串。 Think of it as a pool of strings, all strings containing "Fooo" will point to the same buffer. 可以把它想象成一个字符串池，所有包含“Fooo”的字符串都指向同一个缓冲区。 Note that these languages support garbage collection, which rather helps here. 请注意，这些语言支持垃圾收集，这在这里很有帮助。

I would personally pick the "Small String Optimization" here (though it's not exclusive with the other two), simply because it's simple to implement and should actually benefit you (heap allocation cost, locality of reference issues). 我个人会在这里选择“小字符串优化”（虽然它不是与其他两个一起排除），只是因为它实现起来很简单并且实际上应该让你受益（堆分配成本，参考问题的位置）。

The other two technics are somewhat complex in the face of multi-threading, and such are likely error-prone and unlikely to yield any real benefit unless carefully crafted. 另外两种技术在多线程面前有些复杂，这种技术可能容易出错，除非精心设计，否则不太可能带来任何实际好处。

And that brings my last advice: 这带来了我的最后建议：

Advice 3: Don't implement internal locking in an attempt of MultiThreading support 建议3：在尝试MultiThreading支持时不要实现内部锁定

It will slow down the class when used in SingleThreaded context and will not yield as much benefit as you'd think when used in a MultiThreaded one. 当在SingleThreaded上下文中使用时，它将减慢类的速度，并且在MultiThreaded中使用时不会产生与您想象的一样多的好处。

Finally, you could perhaps find something suiting your tastes (or get some pointers) by browsing existing code. 最后，您可以通过浏览现有代码找到适合您口味（或获得一些指示）的内容。 I don't promise to exhibit "smooth" interfaces though: 我不承诺展示“流畅”的界面：

ICU UnicodeString : Unicode support, at least ICU UnicodeString ：至少支持Unicode
std::string : over 100 member methods (counting the various overloads) std :: string ：超过100个成员方法（计算各种重载）
llvm StringRef : note how many algorithms are implemented as member methods :'( llvm StringRef ：注意有多少算法作为成员方法实现：'（

Scott Meyers的有效STL对可能的std::string实现技术进行了一些有趣的讨论，尽管它涵盖了相当高级的问题，例如写时复制和引用计数。

根据“自定义”的内容（例如自定义分配器），您可以通过std :: basic_string类的模板参数来完成。

Herb Sutter gives a sample of a custom string class in the GotW #29 . Herb Sutter在GotW＃29中提供了一个自定义字符串类的示例。 You could use it for the start. 你可以用它开始。

From a general-purpose point of view a "new" string class ideally combined the good points of std::string, CString, QString and others. 从通用的角度来看，“新”字符串类理想地结合了std :: string，CString，QString等的优点。 A few points in random order: 以随机顺序排列的几点：

MFC CString supports using it in printf-like functions due to a very specific implementation. 由于非常具体的实现，MFC CString支持在类似printf的函数中使用它。 If you need or want this feature I recommend buying the book "MFC Internals" by George Sheperd. 如果您需要或想要此功能，我建议您购买George Sheperd所着的“MFC Internals”一书。 Although the book is from 1996(!) it's description of how CString is implemented should be worth it. 虽然这本书是从1996年开始的（！），但它对CString如何实现的描述应该是值得的。 http://www.amazon.com/MFC-Internals-Microsoft-Foundation-Architecture/dp/0201407213/ref=sr_1_1?ie=UTF8&s=books&qid=1283176951&sr=8-1 http://www.amazon.com/MFC-Internals-Microsoft-Foundation-Architecture/dp/0201407213/ref=sr_1_1?ie=UTF8&s=books&qid=1283176951&sr=8-1
Check that your string class plays nicely with all interfaces you'll use it with (iostreams, Windows API, printf*, etc.) 检查您的字符串类是否与您将使用它的所有接口（iostreams，Windows API，printf *等）完美匹配
Don't aim for full unicode support (as in: collation, grapheme clusters, ...) as that will mean your class will never be done, but consider making it a wchar_t class with conversion options. 不要瞄准完整的unicode支持（如：collation，grapheme clusters，......）因为这意味着你的类永远不会完成，但考虑将它变成带有转换选项的wchar_t类。
Consider making the ctor/function that creates your string objects from char* always take the specific encoding of the character arrays. 考虑使用char *创建字符串对象的ctor / function始终采用字符数组的特定编码。 (Can be helpful in mixed UTF-8 / other character sets environments.) （在混合UTF-8 /其他字符集环境中可能会有所帮助。）
Look at the full CString interface and at the full std:string interface and decide what you are going to need and what you can skip. 查看完整的CString接口和完整的std：string接口，确定您需要的内容以及可以跳过的内容。
Look at QString to see what the other two miss. 看看QString，看看其他两个错过了什么。
Do not provide implicit conversion to neither char/wchar_t* 不要既不字符/ wchar_t的提供隐式转换*
Consider adding convenient conversion functions to/from numeric types. 考虑向数字类型添加方便的转换函数。
Don't write a string class without a full set of detailed Unit Tests! 如果没有完整的详细单元测试，请不要编写字符串类！

The world doesn't need another string class. 世界不需要另一个字符串类。 Is this homework? 这是家庭作业吗？ If not, use std::string . 如果没有，请使用std::string 。

The problem with std::string is.. that you can't change it. std :: string的问题是..你无法改变它。 Sometimes you need the basics of a std::string, but disagree with the implementation of your c++ library. 有时你需要std :: string的基础知识，但不同意你的c ++库的实现。

As an example, thread-safe reference counting employed means lots of locking (or at least locked operations). 例如，使用线程安全引用计数意味着大量锁定（或至少锁定操作）。 Also, if most of your strings are short (because you know this will be the case), you might want a string class that is optimized for that use-case. 此外，如果您的大多数字符串很短（因为您知道会出现这种情况），您可能需要一个针对该用例进行优化的字符串类。

So even if you like the std::string API, or at least have learned to live with it, there is room for 'competing implementations' that are more or less workalikes. 因此，即使您喜欢std :: string API，或者至少已经学会了它，但仍然存在“竞争实现”的空间，这些实现或多或少都是相似的。

PowerDNS would love to have one, as we currently pass many dns host names around, and a large majority of them would fit in a, say, 25 bytes fixed buffer, which would relieve a lot of new/delete pressure. PowerDNS会喜欢有一个，因为我们目前传递了许多dns主机名，并且其中绝大多数都适合于25字节的固定缓冲区，这将减轻很多新的/删除压力。