C++ 高效解析一个 Python 数组字符串

Question

我有一个从一系列 Python arrays 创建的文件。 我从ifstream加载它。 该文件是文本，只包含 arrays。 它的形式是：

[[1 22 333 ... 9
  2 2 2    ... 2]
 ...    
 [5 6 2 ... 222
  5 5 5 ... 240]]

[[2 3 444 ... 9]
 ...    
 [5 6 2 ... 222
  5 5 5 ... 240]]

[[ etc...

每个数组的每一行都以[开头并以]结尾，但可以在文件中分成几行（即，在打开和关闭[]中有回车或换行符。整个数组以括号[]开始和结束。

编号的类型将始终为 integer。 对于特定数组的每一行，每行中的条目数（即列数）将相同，但不同 arrays 之间的数字可能不同。 数组中的行数未知，并且在 arrays 之间可能有所不同。 并且每个文件的 arrays 的总数在打开文件之前也是未知的。

arrays 可以以任何格式存储。 为了这个例子，让我们把它们放在一个由向量组成的向量中，即

typedef vector<vector<int>> myArray;  //Index [row][col]
typedef vector<myArray> myArrays;

我想有效地解析这个（可能非常大的文件，很可能很多文件）。 我的老板非常热衷于为此使用std::regex ，只要它有效，我就很满意。

所以我的问题是：如何使用正则表达式有效地解析它。 是否有一种方法可以在没有正则表达式的情况下更有效地解析它？

Answer 1

std::from_chars()是有效的，因为它就地分析字符串的一部分并准确地告诉分析结束的位置，这样您就可以在不提取子字符串的情况下立即进行 go 。 此外，文档中的注释说：

与 C++ 和 C 库中的其他解析函数不同，std::from_chars 是独立于语言环境、非分配和非抛出的。 仅提供了其他库（例如 std::sscanf）使用的一小部分解析策略。 这旨在允许在常见的高吞吐量上下文（例如基于文本的交换（JSON 或 XML））中有用的最快实现。

这是解析您的数据的尝试。

/**
  g++ -std=c++17 -o prog_cpp prog_cpp.cpp \
      -pedantic -Wall -Wextra -Wconversion -Wno-sign-conversion \
      -g -O0 -UNDEBUG -fsanitize=address,undefined
**/

#include <iostream>
#include <sstream>
#include <charconv>
#include <cctype>
#include <string>
#include <vector>
#include <stdexcept>

using MyRow = std::vector<int>;
using MyArray = std::vector<MyRow>;

std::vector<MyArray>
parse_arrays(std::istream &input_stream)
{
  auto arrays=std::vector<MyArray>{};
  auto line=std::string{};
  for(auto depth=0, line_count=1;
      std::getline(input_stream, line);
      ++line_count)
  {
    for(const auto *first=data(line), *last=first+size(line);
        first!=last;)
    {
      // try first to consume all well known characters
      for(auto c=*first; std::isspace(c)||(c=='[')||(c==']'); c=*(++first))
      {
        switch(c)
        {
          case '[': // opening a row or an array
          {
            switch(++depth)
            {
              case 1:
              {
                arrays.emplace_back(MyArray{});
                break;
              }
              case 2:
              {
                arrays.back().emplace_back(MyRow{});
                break;
              }
              default:
              {
                const auto pfx="line "+std::to_string(line_count);
                throw std::runtime_error{pfx+": too deep"};
              }
            }
            break;
          }
          case ']': // closing a row or an array
          {
            switch(--depth)
            {
              case 0:
              {
                // nothing more to be done
                break;
              }
              case 1:
              {
                const auto &a=arrays.back();
                const auto sz=size(a);
                if((sz>1)&&(size(a[sz-1])!=size(a[sz-2])))
                {
                  const auto pfx="line "+std::to_string(line_count);
                  throw std::runtime_error{pfx+": row length mismatch"};
                }
                break;
              }
              default:
              {
                const auto pfx="line "+std::to_string(line_count);
                throw std::runtime_error{pfx+": ] mismatch"};
              }
            }
            break;
          }
          default: // a separator
          {
            // nothing more to be done
          }
        }
      }
      // the other characters probably represent an integer
      auto value=int{};
      if(auto [p, ec]=std::from_chars(first, last, value); ec==std::errc())
      {
        if(depth!=2)
        {
          const auto pfx="line "+std::to_string(line_count);
          throw std::runtime_error{pfx+": depth mismatch"};
        }
        arrays.back().back().emplace_back(value);
        first=p;
      }
      else
      {
        if(p!=first)
        {
          const auto pfx="line "+std::to_string(line_count);
          throw std::runtime_error{pfx+": integer out of range"};
        }
        else if(first!=last)
        {
          const auto pfx="line "+std::to_string(line_count);
          throw std::runtime_error{pfx+": unexpected char <"+*first+'>'};
        }
      }
    }
  }
  return arrays;
}

int
main()
{
  auto input=std::istringstream{R"(
[[1 22 333  9
  2 2 2     2]
     
 [5 6 2  222
  5 5 5  240]]

[[2 3 444  9]
     
 [5 6 2  222]]
)"};
  const auto arrays=parse_arrays(input);
  for(const auto &a: arrays)
  {
    for(const auto &r: a)
    {
      for(const auto &c: r)
      {
        std::cout << c << ' ';
      }
      std::cout << '\n';
    }
    std::cout << "~~~~~~~~~~~~~~~~\n";
  }
  return 0;
}

/**
1 22 333 9 2 2 2 2 
5 6 2 222 5 5 5 240 
~~~~~~~~~~~~~~~~
2 3 444 9 
5 6 2 222 
~~~~~~~~~~~~~~~~
**/

C++ 高效解析一个 Python 数组字符串

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-02-04 11:31:36

C++ 高效解析一个 Python 数组字符串

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-02-04 11:31:36

解决方案1
1 已采纳 2021-02-04 11:31:36