C++ istream、getline、二进制文件、正则表达式和字符串的意外行为

Question

I'm working with a file that combines text and binary formats (and sometimes just plain text).我正在使用一个结合了文本和二进制格式（有时只是纯文本）的文件。 So I decided to open the file as binary and give it a try.所以我决定将文件作为二进制文件打开并试一试。 However, I'm getting unexpected behaviors when I later use regular expressions (the kind of issues that indicate memory corruption):但是，当我后来使用正则表达式时，我遇到了意外的行为（表明 memory 损坏的问题）：

(Edited to have a minimum example) （编辑有一个最小的例子）

#include <string>
#include <iostream>
#include <fstream>
#include <regex>
#include <ios>

struct FortranFormat {
    std::string itemsPerRow;
    std::string type;
    std::string numberOfCharacters;
};

class XXXParserException: virtual public std::runtime_error {
    using runtime_error::runtime_error;
};


std::string parseSection(const std::string &line) {
    return  line.substr(16, std::string::npos );
}


FortranFormat parse(const std::string& expression) {
    const std::regex getItemsExpr("\\(([0-9]+)([A|a|I|i|F|f|E|e])([0-9]+)\\)");

    std::cout << "expression: " << expression << std::endl;

    std::smatch elements;
    if (std::regex_match(expression, elements, getItemsExpr)) {
        
        return {elements[1].str(),elements[2].str(),elements[3].str()};
    } else {
        throw XXXParserException("The expression " + expression + " is not a recognized Fortran Format.");
    }
}

void main() {

    std::ifstream fb;
    fb.open("example.txt", std::ios::binary); // remove the binary flag, and it works
    std::string line;
    getline(fb, line);
    std::cout << "line: " << line << std::endl;
    std::string formula = parseSection(line);
  
    auto format = parse(formula);
    
    std::cout << "format: " << format.type << std::endl;
}

The prints have the right information:印刷品有正确的信息：

line: *VALUES        6(5E16.8)
expression: (5E16.8)

(Even the exception text is broken and only has the last portion: " is not a recognized Fortran Format.") （即使异常文本被破坏，只有最后一部分：“不是可识别的 Fortran 格式。”）

So, more out of curiosity than nothing: Am I doing something fundamentally wrong that is breaking something internally?所以，更多的是出于好奇：我是否在做一些根本上错误的事情，从而破坏了内部的某些东西？ Is this something that could be attributed to the compiler (VS2015)?这是否可以归因于编译器（VS2015）？

Just FYI, I will try a "jump between formats approach" to solve the issue (save the current position, close and open as text or binary as needed, restore position), but I just want to understand what might be wrong with my current approach.仅供参考，我将尝试“在格式之间跳转方法”来解决问题（保存当前的 position，根据需要以文本或二进制文件的形式关闭和打开，恢复位置），但我只是想了解我的当前可能有什么问题方法。

Answer 1

There are two things to consider:有两点需要考虑：

In text mode, \n is handled as the native EOL combination (so it would be \r\n on Windows).在文本模式下， \n作为本机 EOL 组合处理（因此在 Windows 上为\r\n ）。 In binary mode no such thing is done, so \n is always the newline feed character and nothing else.在二进制模式下没有这样的事情，所以\n总是换行符，没有别的。 You ask to read as much text as possible until \n , which on Windows leaves you with \r at the end of the string.您要求阅读尽可能多的文本，直到\n ，在 Windows 上，您在字符串末尾留下\r 。

Then, std::regex_match requires the whole string to match regex.然后， std::regex_match需要整个字符串来匹配正则表达式。 Your regex doesn't allow for extra whitespace at the end of string, so it doesn't match.您的正则表达式不允许在字符串末尾有额外的空格，因此它不匹配。 std::regex_search would return true on that input, because a substring without the last character matches the pattern. std::regex_search将在该输入上返回true ，因为没有最后一个字符的 substring 与模式匹配。

Protip: Raw string literals make regexes much easier, because you don't have to escape literals (easy to debug in regex101 now:): Protip：原始字符串文字使正则表达式更容易，因为您不必转义文字（现在很容易在 regex101 中调试:)：

const std::regex getItemsExpr(R"eos(\(([0-9]+)([A|a|I|i|F|f|E|e])([0-9]+)\))eos");

C++ istream、getline、二进制文件、正则表达式和字符串的意外行为

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-07-20 16:24:43

C++ istream、getline、二进制文件、正则表达式和字符串的意外行为

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-07-20 16:24:43

解决方案1
1 已采纳 2020-07-20 16:24:43