简体   繁体   English

使用stringstream逐行读取文件两次

[英]Read a file line-by-line twice using stringstream

I Need to read a file line-by-line twice. 我需要逐行读取文件两次。 The file content is expected to fit into memory. 文件内容应适合内存。 So, I would normally read the whole file into a buffer and work with that buffer afterwards. 因此,通常我会将整个文件读入缓冲区,然后再使用该缓冲区。

However, since I would like to use std::getline , I need to work with a std::basic_istream . 但是,由于我想使用std::getline ,因此我需要使用std::basic_istream So, I thought it would be a good idea to write 所以,我认为写一个好主意

std::ifstream file(filepath);
std::stringstream ss;
ss << file.rdbuf();

for (std::string line; std::getline(ss, line);)
{
}

However, I'm not sure what exactly is happening here. 但是,我不确定这里到底发生了什么。 I guess ss << file.rdbuf(); 我想ss << file.rdbuf(); does not read the file into any internal buffer of ss . 将文件读入的任何内部缓冲区ss Actual file access should occure only at std::getline(ss, line); 实际文件访问应仅在std::getline(ss, line); .

So, with a second for-loop of the provided form, I should end in reading the whole file once again. 因此,在提供的表单的第二个for循环中,我应该以再次读取整个文件为结尾。 That's inefficient. 效率低下。

Am I correct and hence need to come up with an other approach? 我是否正确,因此需要提出其他方法?

I guess ss << file.rdbuf(); 我想ss << file.rdbuf(); does not read the file into any internal buffer of ss . 不会将文件读入ss任何内部缓冲区。 Actual file access should occure only at std::getline(ss, line); 实际文件访问应仅在std::getline(ss, line); .

This is incorrect. 这是不正确的。 cppreference.com has this to say about that operator<< overload: cppreference.com关于该operator<<重载的说法如下:

 basic_ostream& operator<<( std::basic_streambuf<CharT, Traits>* sb); (9) 

9) Behaves as an UnformattedOutputFunction. 9)表现为UnformattedOutputFunction。 After constructing and checking the sentry object, checks if sb is a null pointer. 在构造并检查了哨兵对象之后,检查sb是否为空指针。 If it is, executes setstate(badbit) and exits. 如果是,则执行setstate(badbit)并退出。 Otherwise, extracts characters from the input sequence controlled by sb and inserts them into *this until one of the following conditions are met: 否则,从sb控制的输入序列中提取字符,并将它们插入*this直到满足以下条件之一:

  • end-of-file occurs on the input sequence; 文件结束出现在输入序列上;
  • inserting in the output sequence fails (in which case the character to be inserted is not extracted); 在输出序列中插入失败(在这种情况下,不会提取要插入的字符);
  • an exception occurs (in which case the exception is caught). 发生异常(在这种情况下将捕获异常)。

If no characters were inserted, executes setstate(failbit) . 如果未插入任何字符,则执行setstate(failbit) If an exception was thrown while extracting, sets failbit and, if failbit is set in exceptions() , rethrows the exception. 如果一个异常被提取时抛出,套failbit ,如果failbit被设置exceptions()重新抛出异常。

So your assumption is incorrect. 因此,您的假设是不正确的。 The entire contents of file is copied to the buffer controlled by ss , so reading from ss does not access the filesystem. file的全部内容都将复制到ss控制的缓冲区中,因此从ss读取不会访问文件系统。 You can freely read through ss and seek back to the beginning as many times as you like without incurring the overhead of re-reading the file each time. 您可以自由地通读ss并根据需要多次返回到开头,而不会产生每次重新读取文件的开销。

After the first loop, clear the EOF and fail bits and go back to the beginning of the stringstream with: 在第一个循环之后,清除EOF和失败位,并使用以下命令返回到stringstream的开头:

ss.clear();
ss.seekg(0, std::ios::beg);

Am I correct and hence need to come up with an other approach? 我是否正确,因此需要提出其他方法?

You're not correct. 你说的不对。 The "hense" is unwarranted also. “兴致”也是不必要的。 There's not enough info in the question, but I suspect the problem has nothing to do with using a stream buffer. 问题中没有足够的信息,但是我怀疑问题与使用流缓冲区无关。

Without knowing what that first "garbage" character is, I cannot say for sure, but I suspect the file is in a wide-character unicode format, and you are using access operations that do not work on wide characters. 不知道第一个“垃圾”字符是什么,我不能肯定地说,但是我怀疑该文件是宽字符unicode格式,并且您正在使用对宽字符不起作用的访问操作。 If that is the case, buffering the file has nothing to do with the problem. 如果是这种情况,则缓冲文件与该问题无关。

As an experiment, try the following. 作为实验,请尝试以下方法。 Mind the w's. 注意w的。

    std::wifstream file(filepath);
    std::wstringstream ss;
    ss << file.rdbuf();

    for (int i = 0; i < 42; ++i) {
        wchar_t ch;
        ss >> ch;
        std::cout << static_cast<unsigned>(ch) << ' ';
    }

It would not surprise me if the first four numbers are 255 254 92 0, or 255 254 47 0. 如果前四个数字是255 254 92 0或255 254 47 0,这不会令我感到惊讶。

This might help: Problem using getline with unicode files 这可能会有所帮助: 将getline与unicode文件一起使用时出现问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM