简体   繁体   中英

C++ does std::string::erase reallocate and…?

First question, does std::string::erase reallocate?

Second question, are there any faster method to quickly erase certain words or phrase from a std::string ? The length of the string is usually around 300K.

It is not defined if string::erase is going to trigger a reallocation. You can check by comparing string::capacity to after and before calling the method to see what happens. Removing parts of a string is always going to trigger a copy of all characters that come after the erased parts, since the storage of a string is required to be continuous.

For operations on large strings you might want to consider using a rope or a std::list instead. This might turn out faster depending on what you do.

21.4.1/3

No erase() or pop_back() member function shall throw any exceptions.

Since no such restriction exists on the allocator, I think that it is safe to say that no, std::string::erase does not, and can not, reallocate.

You might want to have a look at rope . It is a heavy-duty string (get it?) designed for large strings, with much faster substring operations. Unfortunately, it isn't part of the std , but rather a common addition (in SGI, STLPort and GNU's libstdc++).

See STL Rope - when and where to use

It's already been mentioned that it's implementation dependent whether std::string::erase triggers a reallocation. so I wanted to focus on the string searching. The traditional approach to this problem would be to use the Aho-Corasick algorithm .

Alternatively, David Musser wrote a paper on searching for needles (substrings) in large haystacks (strings) using a hybrid of the Boyer-Moore and Knuth-Morris-Pratt algorithms. The paper is available here . Adapting this would probably be simpler than rolling an Aho-Corasick implementation.

Musser's approach exhibits must faster behavior than the naive search and replace. It should be possible to adapt the algorithm for your purposes by modifying the BM skip loop and KNP lookup table to account for all of the needles that you are looking to replace. Allocate an output buffer in advance and iteratively construct the output string by appending to it all non-matching segments of the haystack. This approach will get less effective as the number of needles grows and the BM/KNP lookups saturate.

从我对STL的了解中,我可以看到在std::string::erase期间重新分配字符串的条件是: if (__new_size > this->capacity() || _M_rep()->_M_is_shared())我认为这意味着字符串在erase呼叫期间未重新分配。

  1. No, std::string::erase does not reallocate - because it does not need to and because it's C++ philosophy that you don't pay (reallocation time) for what you don't need.
  2. Depends on what you want to erase and what you mean with quickly (quickly to type or to perform).

First thing to do is of course find a fast algorithm to find the words/phrases that you want to remove. Then, if there's only one chunk to erase, std::string::erase should be perfectly suited for your needs. However, if for example you have the string "000aa11111bbbbb2222222c3333333333" and want to erase all phrases containing letters, just finding and erasing them one after another will lead to multiple copies of the remainder of the string - the '1's will get copied once, '2's will get copied twice and so on. So if there are many phrases to erase in the string, there will be a possibility to improve performance - just copy the chunks that should remain in the string individually and overwrite the chunks you want to erase: (| denotes an iterator until which the string is "correct"):

  • "000|aa11111bbbbb2222222c3333333333"
  • "00011111|11bbbbb2222222c3333333333"
  • "000111112222222|2222222c3333333333"
  • "0001111122222223333333333|33333333"
  • "0001111122222223333333333"

That way, you have to copy every character after the first erased phrase exactly once.

I'm using VC6 from MS and this last DO reallocate buffer on std::string::erase() call. I had to remove all erase() calls from my code as I'm sometimes using big strings and I found some big slow down due to this. So care about your compiler and avoid erase(). Personally, I use reaffectations str = ""; as a workaround.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM