简体   繁体   English

C ++会重新分配std :: string :: erase和…吗?

[英]C++ does std::string::erase reallocate and…?

First question, does std::string::erase reallocate? 第一个问题, std::string::erase重新分配?

Second question, are there any faster method to quickly erase certain words or phrase from a std::string ? 第二个问题,是否有更快的方法可以快速删除std::string某些单词或短语? The length of the string is usually around 300K. 字符串的长度通常约为300K。

It is not defined if string::erase is going to trigger a reallocation. 如果string::erase要触发重新分配,则未定义。 You can check by comparing string::capacity to after and before calling the method to see what happens. 您可以通过将string::capacity与调用该方法前后的比较来检查发生了什么。 Removing parts of a string is always going to trigger a copy of all characters that come after the erased parts, since the storage of a string is required to be continuous. 删除字符串的一部分总是会触发被擦除部分之后的所有字符的副本,因为需要连续存储字符串。

For operations on large strings you might want to consider using a rope or a std::list instead. 对于大型字符串的操作,您可能需要考虑使用绳索或std :: list代替。 This might turn out faster depending on what you do. 根据您的操作,结果可能会更快。

21.4.1/3 21.4.1 / 3

No erase() or pop_back() member function shall throw any exceptions. 任何delete()或pop_back()成员函数均不得抛出任何异常。

Since no such restriction exists on the allocator, I think that it is safe to say that no, std::string::erase does not, and can not, reallocate. 由于在分配器上不存在这种限制,因此我认为可以肯定地说, std::string::erase不而且不能重新分配。

You might want to have a look at rope . 您可能想看看rope It is a heavy-duty string (get it?) designed for large strings, with much faster substring operations. 它是为大型字符串设计的重型字符串(得到它吗?),子字符串操作快得多。 Unfortunately, it isn't part of the std , but rather a common addition (in SGI, STLPort and GNU's libstdc++). 不幸的是,它不是std一部分,而是常见的添加项(在SGI,STLPort和GNU的libstdc ++中)。

See STL Rope - when and where to use 请参阅STL绳索-何时何地使用

It's already been mentioned that it's implementation dependent whether std::string::erase triggers a reallocation. 已经提到过std :: string :: erase是否触发重新分配取决于实现。 so I wanted to focus on the string searching. 所以我想专注于字符串搜索。 The traditional approach to this problem would be to use the Aho-Corasick algorithm . 解决此问题的传统方法是使用Aho-Corasick算法

Alternatively, David Musser wrote a paper on searching for needles (substrings) in large haystacks (strings) using a hybrid of the Boyer-Moore and Knuth-Morris-Pratt algorithms. 另外,David Musser撰写了一篇有关使用Boyer-Moore和Knuth-Morris-Pratt算法混合算法在大型干草堆(字符串)中搜索针(子字符串)的论文。 The paper is available here . 本文可在此处获得 Adapting this would probably be simpler than rolling an Aho-Corasick implementation. 适应这一点可能比推出Aho-Corasick实现要简单得多。

Musser's approach exhibits must faster behavior than the naive search and replace. 穆瑟(Musser)的方法所展示的行为必须比幼稚的搜索和替换要快。 It should be possible to adapt the algorithm for your purposes by modifying the BM skip loop and KNP lookup table to account for all of the needles that you are looking to replace. 通过修改BM跳过循环和KNP查找表以考虑要更换的所有针,应该有可能使算法适应您的目的。 Allocate an output buffer in advance and iteratively construct the output string by appending to it all non-matching segments of the haystack. 预先分配一个输出缓冲区,并通过将所有未匹配的干草堆段附加到输出字符串上来迭代构造输出字符串。 This approach will get less effective as the number of needles grows and the BM/KNP lookups saturate. 随着针数的增加和BM / KNP查找的饱和,这种方法的效果会降低。

从我对STL的了解中,我可以看到在std::string::erase期间重新分配字符串的条件是: if (__new_size > this->capacity() || _M_rep()->_M_is_shared())我认为这意味着字符串在erase呼叫期间未重新分配。

  1. No, std::string::erase does not reallocate - because it does not need to and because it's C++ philosophy that you don't pay (reallocation time) for what you don't need. 不, std::string::erase不会重新分配-因为它不需要,并且因为C ++的哲学是您不需要为不需要的东西付钱(重新分配时间)。
  2. Depends on what you want to erase and what you mean with quickly (quickly to type or to perform). 取决于您要擦除的内容以及快速含义(快速键入或执行)。

First thing to do is of course find a fast algorithm to find the words/phrases that you want to remove. 首先要做的当然是找到一种快速算法,以查找要删除的单词/短语。 Then, if there's only one chunk to erase, std::string::erase should be perfectly suited for your needs. 然后,如果仅要擦除一个块,则std::string::erase应该非常适合您的需求。 However, if for example you have the string "000aa11111bbbbb2222222c3333333333" and want to erase all phrases containing letters, just finding and erasing them one after another will lead to multiple copies of the remainder of the string - the '1's will get copied once, '2's will get copied twice and so on. 但是,例如,如果您具有字符串“ 000aa11111bbbbbbb2222222c3333333333”,并且想要删除所有包含字母的短语,那么一个接一个地查找和擦除它们将导致字符串其余部分的多个副本-'1将被复制一次,' 2将被复制两次,依此类推。 So if there are many phrases to erase in the string, there will be a possibility to improve performance - just copy the chunks that should remain in the string individually and overwrite the chunks you want to erase: (| denotes an iterator until which the string is "correct"): 因此,如果字符串中有许多要删除的短语,则有可能提高性能-只需单独复制应保留在字符串中的块并覆盖要擦除的块即可:(|表示迭代器,直到字符串是正确的”):

  • "000|aa11111bbbbb2222222c3333333333" “ 000 | aa11111bbbbb2222222c3333333333”
  • "00011111|11bbbbb2222222c3333333333" “ 00011111 | 11bbbbb2222222c3333333333”
  • "000111112222222|2222222c3333333333" “ 000111112222222 | 2222222c3333333333”
  • "0001111122222223333333333|33333333" “ 0001111122222223333333333 | 33333333”
  • "0001111122222223333333333" “ 0001111122222223333333333”

That way, you have to copy every character after the first erased phrase exactly once. 这样,您必须将第一个被删除的短语之后的每个字符都复制一次。

I'm using VC6 from MS and this last DO reallocate buffer on std::string::erase() call. 我正在使用来自MS的VC6,这最后一个DO在std :: string :: erase()调用上重新分配了缓冲区。 I had to remove all erase() calls from my code as I'm sometimes using big strings and I found some big slow down due to this. 我不得不从代码中删除所有的delete()调用,因为有时我使用的是大字符串,因此我发现速度变慢了。 So care about your compiler and avoid erase(). 因此,请注意您的编译器,并避免使用指标。 Personally, I use reaffectations str = ""; 就个人而言,我使用reaffectations str =“”; as a workaround. 作为解决方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM