简体   繁体   English

C ++中字符串流与文件I / O流的性能

[英]Performance of string streams versus file I/O streams in C++

I have to read in a huge text file (>200,000 words) and process each word. 我必须阅读一个巨大的文本文件(> 200,000个单词)并处理每个单词。 I read in the entire file into a string and then attach a string stream to it to process easily each word. 我将整个文件读入一个字符串中,然后将一个字符串流附加到该文件中以轻松处理每个单词。 The approach is I directly input each word from file using << and process it but comparing both the approaches does not give me any advantage in terms of execution time. 该方法是我使用<<直接输入文件中的每个单词并对其进行处理,但是比较这两种方法在执行时间方面没有任何优势。 Isn't it faster to operate on a string in memory than from a file which needs a system call every time I need a word? 操作内存中的字符串是否比每次我需要一个单词就需要系统调用的文件要快? Please suggest some performance enhancing methods. 请提出一些性能增强方法。

If you're going to put the data into a stringstream anyway, it's probably a bit faster and easier to copy directly from the input stream to the string stream: 如果您仍然要将数据放入字符串流,则将其直接从输入流复制到字符串流可能会更快,更轻松:

std::ifstream infile("yourfile.txt");
std::stringstream buffer;

buffer << infile.rdbuf();

The ifstream will use a buffer, however, so while that's probably faster than reading into a string, then creating a stringstream, it may not be any faster than working directly from the input stream. 但是, ifstream将使用缓冲区,因此虽然它可能比读取字符串然后创建stringstream的速度更快,但它可能比直接从输入流工作要快得多。

For performance and minimal copying, this is hard to beat (as long as you have enough memory!): 为了提高性能和减少复制,这是很难克服的(只要您有足够的内存!):

void mapped(const char* fname)
{
  using namespace boost::interprocess;

  //Create a file mapping
  file_mapping m_file(fname, read_only);

  //Map the whole file with read permissions
  mapped_region region(m_file, read_only);

  //Get the address of the mapped region
  void * addr       = region.get_address();
  std::size_t size  = region.get_size();

  // Now you have the underlying data...
  char *data = static_cast<char*>(addr);

  std::stringstream localStream;
  localStream.rdbuf()->pubsetbuf(data, size);

  // now you can do your stuff with the stream
  // alternatively
}

There is caching involved, so it does not necessarily do a system call each time you extract. 由于涉及到缓存,因此不一定每次提取时都会进行系统调用。 Having said that, you may get marginally better performance at parse time by parsing a single contiguous buffer. 话虽如此,通过解析单个连续的缓冲区,您在解析时可能会获得略微更好的性能。 On the other hand, you are serializing the workload (read entire file, then parse), which can potentially be parallelized (read and parse in parallel). 另一方面,您正在对工作负载进行序列化(先读取整个文件,然后解析),然后可以并行化(并行读取和解析)。

The string will get reallocated and copied an awful lot of times to accommodate 200,000 words. 该字符串将被重新分配并复制很多次以容纳200,000个单词。 That's probably what is taking the time. 那可能就是花时间。

You should use a rope if you want to create a huge string by appending. 如果要通过添加来创建巨大的字符串,则应使用绳索。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM