简体   繁体   English

如何加快解析大字符串?

[英]How can I speed up parsing of large strings?

So I've made a program that reads in various config files. 所以我制作了一个读取各种配置文件的程序。 Some of these config files can be small, some can be semi-large (largest one is 3,844 KB). 其中一些配置文件可能很小,有些可能是半大的(最大的一个是3,844 KB)。

The read in file is stored in a string (in the program below it's called sample). 读入文件存储在一个字符串中(在它下面的程序中称为sample)。

I then have the program extract information from the string based on various formatting rules. 然后,我根据各种格式规则从字符串中提取程序信息。 This works well, the only issue is that when reading larger files it is very slow.... 这很好用,唯一的问题是,当读取较大的文件时,它非常慢....

I was wondering if there was anything I could do to speed up the parsing or if there was an existing library that does what I need (extract string up until a delimiter & extract string string in between 2 delimiters on the same level). 我想知道是否有任何我可以做的事情来加速解析,或者是否有一个现有的库可以完成我需要的工作(提取字符串,直到在同一级别的2个分隔符之间的分隔符和提取字符串字符串)。 Any assistance would be great. 任何援助都会很棒。

Here's my code & a sample of how it should work... 这是我的代码和它应该如何工作的样本......

#include "stdafx.h"

#include <string>
#include <vector>

std::string ExtractStringUntilDelimiter(
   std::string& original_string,
   const std::string& delimiter,
   const int delimiters_to_skip = 1)
{
   std::string needle = "";

   if (original_string.find(delimiter) != std::string::npos)
   {
      int total_found = 0;

      auto occurance_index = static_cast<size_t>(-1);

      while (total_found != delimiters_to_skip)
      {
         occurance_index = original_string.find(delimiter);
         if (occurance_index != std::string::npos)
         {
            needle = original_string.substr(0, occurance_index);
            total_found++;
         }
         else
         {
            break;
         }
      }

      // Remove the found string from the original string...
      original_string.erase(0, occurance_index + 1);
   }
   else
   {
      needle = original_string;
      original_string.clear();
   }

   if (!needle.empty() && needle[0] == '\"')
   {
      needle = needle.substr(1);
   }
   if (!needle.empty() && needle[needle.length() - 1] == '\"')
   {
      needle.pop_back();
   }

   return needle;
}

void ExtractInitialDelimiter(
   std::string& original_string,
   const char delimiter)
{
   // Remove extra new line characters
   while (!original_string.empty() && original_string[0] == delimiter)
   {
      original_string.erase(0, 1);
   }
}

void ExtractInitialAndFinalDelimiters(
   std::string& original_string,
   const char delimiter)
{
   ExtractInitialDelimiter(original_string, delimiter);

   while (!original_string.empty() && original_string[original_string.size() - 1] == delimiter)
   {
      original_string.erase(original_string.size() - 1, 1);
   }
}

std::string ExtractStringBetweenDelimiters(
   std::string& original_string,
   const std::string& opening_delimiter,
   const std::string& closing_delimiter)
{
   const size_t first_delimiter = original_string.find(opening_delimiter);
   if (first_delimiter != std::string::npos)
   {
      int total_open = 1;
      const size_t opening_index = first_delimiter + opening_delimiter.size();

      for (size_t i = opening_index; i < original_string.size(); i++)
      {
         // Check if we have room for opening_delimiter...
         if (i + opening_delimiter.size() <= original_string.size())
         {
            for (size_t j = 0; j < opening_delimiter.size(); j++)
            {
               if (original_string[i + j] != opening_delimiter[j])
               {
                  break;
               }
               else if (j == opening_delimiter.size() - 1)
               {
                  total_open++;
               }
            }
         }


         // Check if we have room for closing_delimiter...
         if (i + closing_delimiter.size() <= original_string.size())
         {
            for (size_t j = 0; j < closing_delimiter.size(); j++)
            {
               if (original_string[i + j] != closing_delimiter[j])
               {
                  break;
               }
               else if (j == closing_delimiter.size() - 1)
               {
                  total_open--;
               }
            }
         }


         if (total_open == 0)
         {
            // Extract result, and return it...
            std::string needle = original_string.substr(opening_index, i - opening_index);
            original_string.erase(first_delimiter, i + closing_delimiter.size());

            // Remove new line symbols
            ExtractInitialAndFinalDelimiters(needle, '\n');
            ExtractInitialAndFinalDelimiters(original_string, '\n');

            return needle;
         }
      }
   }

   return "";
}

int main()
{
   std::string sample = "{\n"
      "Line1\n"
      "Line2\n"
      "{\n"
         "SubLine1\n"
         "SubLine2\n"
      "}\n"
   "}";

   std::string result = ExtractStringBetweenDelimiters(sample, "{", "}");
   std::string LineOne = ExtractStringUntilDelimiter(result, "\n");
   std::string LineTwo = ExtractStringUntilDelimiter(result, "\n");

   std::string SerializedVector = ExtractStringBetweenDelimiters(result, "{", "}");
   std::string SubLineOne = ExtractStringUntilDelimiter(SerializedVector, "\n");
   std::string SubLineTwo = ExtractStringUntilDelimiter(SerializedVector, "\n");

   // Just for testing...
   printf("LineOne: %s\n", LineOne.c_str());
   printf("LineTwo: %s\n", LineTwo.c_str());
   printf("\tSubLineOne: %s\n", SubLineOne.c_str());
   printf("\tSubLineTwo: %s\n", SubLineTwo.c_str());
   system("pause");
}

Use string_view or a hand rolled one. 使用string_view或手动滚动。

Don't modify the string loaded. 不要修改加载的字符串。

  original_string.erase(0, occurance_index + 1);

is code smell and going to be expensive with a large original string. 代码味道很大,原始字符串很大。

If you are going to modify something, do it in one pass. 如果您要修改某些内容,请一次性完成。 Don't repeatedly delete from the front of it -- that is O(n^2). 不要反复删除它的前面 - 即O(n ^ 2)。 Instead, procceed along it and shove "finished" stuff into an output accumulator. 相反,沿着它继续并将“已完成”的东西推入输出累加器。

This will involve changing how your code works. 这将涉及更改代码的工作方式。

  1. You're reading your data into a string. 您正在将数据读入字符串。 "Length of string" should not be a problem. “字符串的长度”应该不是问题。 So far, so good... 到现在为止还挺好...

  2. You're using "string.find().". 你正在使用“string.find()。”。 That's not necessarily a bad choice. 这不一定是一个糟糕的选择。

  3. You're using "string.erase()". 你正在使用“string.erase()”。 That's probably the main source of your problem. 这可能是你问题的主要来源。

SUGGESTIONS: 几点建议:

  • Treat the original string as "read-only". 将原始字符串视为“只读”。 Don't call erase(), don't modify it. 不要调用erase(),不要修改它。

  • Personally, I'd consider reading your text into a C string (a text buffer), then parsing the text buffer, using strstr() . 就个人而言,我会考虑将文本读入C字符串(文本缓冲区),然后使用strstr()解析文本缓冲区。

Here is a more efficient version of ExtractStringBetweenDelimiters . 这是ExtractStringBetweenDelimiters的更高效版本。 Note that this version does not mutate the original buffer. 请注意,此版本不会改变原始缓冲区。 You would perform subsequent queries on the returned string. 您将对返回的字符串执行后续查询。

std::string trim(std::string buffer, char what)
{
    auto not_what = [&what](char ch)
    {
        return ch != what;
    };
    auto first = std::find_if(buffer.begin(), buffer.end(), not_what);
    auto last = std::find_if(buffer.rbegin(), std::make_reverse_iterator(first), not_what).base();
    return std::string(first, last);
}

    std::string ExtractStringBetweenDelimiters(
        std::string const& buffer,
        const char opening_delimiter,
        const char closing_delimiter)
    {
        std::string result;

        auto first = std::find(buffer.begin(), buffer.end(), opening_delimiter);
        if (first != buffer.end())
        {
            auto last = std::find(buffer.rbegin(), std::make_reverse_iterator(first),
                                         closing_delimiter).base();
            if(last > first)
            {
                result.assign(first + 1, last);
                result = trim(std::move(result), '\n');
            }
        }
        return result;
    }

If you have access to string_view (c++17 for std::string_view or boost::string_view) you could return one of these from both functions for extra efficiency. 如果您可以访问string_view (c ++ 17 for std :: string_view或boost :: string_view),您可以从两个函数中返回其中一个以提高效率。

It's worth mentioning that this method of parsing a structured file is going to cause you problems down the line if any of the serialised strings contains a delimiter, such as a '{'. 值得一提的是,如果任何序列化字符串包含分隔符(例如“{”),这种解析结构化文件的方法将导致问题。

In the end you'll want to write or use someone else's parser. 最后,你会想要编写或使用别人的解析器。

The boost::spirit library is a little complicated to learn, but creates very efficient parsers for this kind of thing. boost::spirit库有点复杂,但是为这种事情创建了非常有效的解析器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM