简体   繁体   English

GCC中的std :: string实现及其短字符串的内存开销

[英]std::string implementation in GCC and its memory overhead for short strings

I am currently working on an application for a low-memory platform that requires an std::set of many short strings (>100,000 strings of 4-16 characters each). 我目前正在开发一个低内存平台的应用程序,它需要一个std :: set的许多短字符串(> 100,000个字符串,每个字符串4-16个字符)。 I recently transitioned this set from std::string to const char * to save memory and I was wondering whether I was really avoiding all that much overhead per string. 我最近将这个集合从std :: string转换为const char *以节省内存,我想知道我是否真的避免了每个字符串的所有开销。

I tried using the following: 我尝试使用以下内容:

std::string sizeTest = "testString";
std::cout << sizeof(sizeTest) << " bytes";

But it just gave me an output of 4 bytes, indicating that the string contains a pointer. 但它只给了我一个4字节的输出,表明该字符串包含一个指针。 I'm well aware that strings store their data in a char * internally, but I thought the string class would have additional overhead. 我很清楚字符串在内部将它们的数据存储在char *中,但我认为字符串类会有额外的开销。

Does the GCC implementation of std::string incur more overhead than sizeof(std::string) would indicate? std :: string的GCC实现是否比sizeof(std :: string)指示的更多开销? More importantly, is it significant over this size of data set? 更重要的是,这个数据集的重要性是否显着?

Here are the sizes of relevant types on my platform (it is 32-bit and has 8 bits per byte): 以下是我平台上相关类型的大小(它是32位,每字节有8位):

char: 1 bytes char:1个字节
void *: 4 bytes void *:4个字节
char *: 4 bytes char *:4个字节
std::string: 4 bytes std :: string:4个字节

Well, at least with GCC 4.4.5, which is what I have handy on this machine, std::string is a typdef for std::basic_string<char> , and basic_string is defined in /usr/include/c++/4.4.5/bits/basic_string.h . 好吧,至少GCC 4.4.5,这是我在这台机器上的便利, std::stringstd::basic_string<char>的typdef,而basic_string是在/usr/include/c++/4.4.5/bits/basic_string.h定义的/usr/include/c++/4.4.5/bits/basic_string.h There's a lot of indirection in that file, but what it comes down to is that nonempty std::string s store a pointer to one of these: 该文件中有很多间接,但它归结为非空std::string存储指向其中一个的指针:

  struct _Rep_base
  {
size_type       _M_length;
size_type       _M_capacity;
_Atomic_word        _M_refcount;
  };

Followed in-memory by the actual string data. 在实际的字符串数据中跟随内存。 So std::string is going to have at least three words of overhead for each string, plus any overhead for having a higher capacity than `length (probably not, depending on how you construct your strings -- you can check by asking the capacity() method). 所以std::string对于每个字符串至少要有三个字的开销,加上capacity高于`length任何开销(可能不是,取决于你如何构造字符串 - 你可以通过询问capacity()来检查capacity()方法)。

There's also going to be overhead from your memory allocator for doing lots of small allocations; 你的内存分配器也会有大量的小额分配开销; I don't know what GCC uses for C++, but assuming it's similar to the dlmalloc allocator it uses for C, that could be at least two words per allocation, plus some space to align the size to a multiple of at least 8 bytes. 我不知道GCC对C ++使用什么,但假设它类似于它用于C的dlmalloc分配器,每个分配至少可以有两个字,加上一些空间可以将大小与至少8个字节的倍数对齐。

I'm going to guess you are on a 32 bit, 8 bit per byte platform. 我猜你是在32位,每字节8位平台上。 I'm also going to guess that at least on the gcc version you are using, that they are using a reference counted implementation for std::string. 我还要猜测,至少在您使用的gcc版本上,他们正在使用std :: string的引用计数实现。 The 4 byte sizeof you see is a pointer to a structure containing the reference count and the string data (and any allocator state if applicable). 您看到的4字节大小是指向包含引用计数和字符串数据(以及任何分配器状态,如果适用)的结构的指针。

In this design of gcc's the only "short" string has size == 0, in which case it can share a representation with every other empty string. 在这个gcc的设计中,唯一的“短”字符串的大小为== 0,在这种情况下,它可以与其他每个空字符串共享一个表示。 Otherwise you get a refcounted COW string. 否则,您将获得一个refcounted COW字符串。

To investigate this yourself, code up an allocator that keeps track of how much memory it allocates and deallocates, and how many times. 要自己调查一下,编写一个分配器来跟踪它分配和释放多少内存,以及多少次。 Use this allocator to investigate the implementation of the container you're interested in. 使用此分配器来调查您感兴趣的容器的实现。

If it's guaranteed that ">100,000 strings of 4-16 characters each", then don't use std::string. 如果它保证“> 100,000个字符串,每个4-16个字符”,那么不要使用std :: string。 Instead, write your own ShortString class. 相反,编写自己的ShortString类。 It's interesting that "sizeof(std::string) == 4", how is that possible? 有趣的是“sizeof(std :: string)== 4”,这怎么可能? What are sizeof(char) and sizeof(void *)? 什么是sizeof(char)和sizeof(void *)?

I've performed some comparisons about std::string overhead. 我已经对std :: string开销进行了一些比较。 In general it is about 48 bytes! 一般来说它大约是48个字节! Take a look at the article on my blog: http://jovislab.com/blog/?p=76 看看我博客上的文章: http//jovislab.com/blog/?p = 76

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM