简体   繁体   English

在C ++中分割字符串的更快方法

[英]Faster way of splitting string in C++

I have a string of length 5 million to break into substrings of desired length (5 or 10 or ...). 我有一个长度为500万的字符串,可以分成所需长度的子字符串(5或10或...)。 and store the fragments into a vector. 并将片段存储到向量中 The way I do it seems to take ages. 我的方法似乎需要很长时间。 Looking for an ultra fast method. 寻找一种超快速的方法。

Example code how i do it. 示例代码我该怎么做。 Test here 在这里测试

// Example program
#include <iostream>
#include <string>
#include <vector>

int main()
{
   std::vector<std::string> splits;
   std::string text = "ABCDBCDAACBDAADCADACBBCDACDADBCAACDBCADACD";

   for(int i = 0; i < text.length() ; i+= 5)
   {
     splits.push_back(text.substr (i, 5));
     std::cout << "splits: " << text.substr(i, 5) << std::endl;

   }

}

This will be a little bit faster. 这会快一点。

#include <iostream>
#include <string>
#include <vector>

int main()
{
   std::vector<std::string> splits;
   std::string text = "ABCDBCDAACBDAADCADACBBCDACDADBCAACDBCADACD";

   // Start timing
   splits.reserve( (text.length()+5-1)/5 );

   const auto end = text.begin() +(text.length()/5)*5;
   auto it = text.begin();
   for(; it < end; it += 5)
   {
     splits.emplace_back(it, it+5);
   }

   if (it != text.end())
   {
       splits.emplace_back(it,text.end());
   }
   //end timing

   for (const auto& str : splits)
   {
       std::cout << "splits: " << str << std::endl;
   }
}

Rather than creating a new string with substr , and then copying that string into the vector, it creates the string directly. 与其使用substr创建一个新字符串,然后将该字符串复制到向量中,不如直接创建该字符串。 To make this as simple as possible, the main loop only creates full length strings, and then any partial string at the end is handled separately. 为了使此过程尽可能简单,主循环仅创建全长字符串,然后将结尾处的任何部分字符串分别处理。

It also removes the printing from the timing loop (if you really are doing that, don't! IO is slow). 它还可以从定时循环中删除打印内容(如果您确实要这样做,请不要这样做!IO速度很慢)。

Finally enough space is reserved in the vector before creating the strings (although I notice you say in the comments you are doing that). 最后,在创建字符串之前,向量中会保留足够的空间(尽管我注意到您在注释中说过)。

Having said all that, an alternative representation where you don't use std::string, but just use an offset + length in text will be much faster still. 话虽如此,如果您不使用std :: string,而只在text使用offset + length的替代表示形式,速度将会更快。

Given that you know you are only holding short strings, a separate class which has a fixed length array (15 bytes?) plus a length (1 byte). 假定您只持有短字符串,那么是一个单独的类,该类具有固定长度的数组(15个字节?)加上一个长度(1个字节)。 Might be an intermediate step. 可能是中间步骤。 glibc doesn't have the short string optimization, so allocating 20 million chunks of memory won't be that fast. glibc没有短字符串优化,因此分配2000万块内存不会那么快。

Final thought: You have enabled optimization, haven't you? 最后的想法:您启用了优化,不是吗? It will make a huge difference. 这将产生巨大的变化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM