简体   繁体   中英

Faster way of splitting string in C++

I have a string of length 5 million to break into substrings of desired length (5 or 10 or ...). and store the fragments into a vector. The way I do it seems to take ages. Looking for an ultra fast method.

Example code how i do it. Test here

// Example program
#include <iostream>
#include <string>
#include <vector>

int main()
{
   std::vector<std::string> splits;
   std::string text = "ABCDBCDAACBDAADCADACBBCDACDADBCAACDBCADACD";

   for(int i = 0; i < text.length() ; i+= 5)
   {
     splits.push_back(text.substr (i, 5));
     std::cout << "splits: " << text.substr(i, 5) << std::endl;

   }

}

This will be a little bit faster.

#include <iostream>
#include <string>
#include <vector>

int main()
{
   std::vector<std::string> splits;
   std::string text = "ABCDBCDAACBDAADCADACBBCDACDADBCAACDBCADACD";

   // Start timing
   splits.reserve( (text.length()+5-1)/5 );

   const auto end = text.begin() +(text.length()/5)*5;
   auto it = text.begin();
   for(; it < end; it += 5)
   {
     splits.emplace_back(it, it+5);
   }

   if (it != text.end())
   {
       splits.emplace_back(it,text.end());
   }
   //end timing

   for (const auto& str : splits)
   {
       std::cout << "splits: " << str << std::endl;
   }
}

Rather than creating a new string with substr , and then copying that string into the vector, it creates the string directly. To make this as simple as possible, the main loop only creates full length strings, and then any partial string at the end is handled separately.

It also removes the printing from the timing loop (if you really are doing that, don't! IO is slow).

Finally enough space is reserved in the vector before creating the strings (although I notice you say in the comments you are doing that).

Having said all that, an alternative representation where you don't use std::string, but just use an offset + length in text will be much faster still.

Given that you know you are only holding short strings, a separate class which has a fixed length array (15 bytes?) plus a length (1 byte). Might be an intermediate step. glibc doesn't have the short string optimization, so allocating 20 million chunks of memory won't be that fast.

Final thought: You have enabled optimization, haven't you? It will make a huge difference.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM