简体   繁体   中英

Most efficient way to extract certain lines from a text file

I have a log file of variable length which may or may not contain the strings I'm looking for.

Lines have timestamps etc followed by < parameter >#< value > I want to check the parameter and extract the value.

The implementation below works but I'm sure there must be a more efficient way to parse the file.

Key points:

  • Most lines are going to be ignored
  • There are approx 1600 log files of between 1 - 20 Mb
  • Even a small gain per file will be an advantage

NB. the parse function calls substring then converts that to an int

Any ideas much appreciated

ifstream fileReader(logfile.c_str());
string lineIn;
if(fileReader.is_open())
{

while(fileReader.good())
{
    getline(fileReader,lineIn);

    if(lineIn.find("value1#") != string::npos)
    {
        parseValue1(lineIn);
    }
    else if(lineIn.find("value2#") != string::npos)
    {
        parseValue2(lineIn);
    }
    else if(lineIn.find("value3#") != string::npos)
    {
        parseValue3(lineIn);
    }   
}
}
fileReader.close();

First of all you are doing loop wrong. your code should be:

while( getline( fileReader,lineIn ) ) {
}

Second, lines:

if( fileReader.is_open() )

and

fileReader.close();

are redundant. As for speed. I would recommend using regular expression:

std::regex reg ( "(value1#)|(value#2)|(value#3)(\\d+)" );
while( getline( fileReader,lineIn ) ) {
    std::smatch m;
    if( std::regex_search( lineIn.begin(), lineIn.end(), m, reg ) ) {
        std::cout << "found: " << m[4] << std::endl;
    } 
}

Of course you would need to modify regular expression accordingly.

Unfortunately, iostreams are known to be pretty slow. If you would not get enough performance you may consider to replace fstream with FILE * or mmap.

Looks like a lot of repeated searches in the same string, which will not be very efficient.

Parse the file/line in a proper way.

There are three libraries in Boost that might be of help.

Parse the line using a regular expression: http://www.boost.org/doc/libs/1_53_0/libs/regex/doc/html/index.html

Use a tokenizer http://www.boost.org/doc/libs/1_53_0/libs/tokenizer/index.html

For full customization you can always use Spirit. http://www.boost.org/doc/libs/1_53_0/libs/spirit/doc/html/index.html

The first step would be to figure out how much of the time is spent in the if(lineIn.find(...)... and how much is the actual reading of input file.

Time the time your application runs for (you may want to take a selection of log-files, rather than ALL of them). You may want to run this a few times in a row to see that you get the same (approximately) value.

The add:

#if 0
if (lineIn.find(...) ...) 
...
#endif

and compare the time it takes. My guess is that it won't actually make that much of a difference. However, if the searching is a major component of the time, you may find that it's beneficial to use a more clever search method. There are some pretty clever methods for searching for strings in a larger string.

I will post back with a couple of benchmarks of "read a file quicker" that I've posted elsewhere. But bear in mind that the hard-disk that you are reading from will be the major amount of time.

References:

getline while reading a file vs reading whole file and then splitting based on newline character

slightly less relevant, but perhaps interesting:

What is the best efficient way to read millions of integers separated by lines from text file in c++

Your execution bottleneck will be in file I/O.
I suggest that you haul in as much data as possible in one fetch into a buffer. Next, search the buffer for your tokens.

You have to read in the text in order to search it, so you might as well read in as much of the file as you can.

There may be some drawbacks in reading too much data into memory. If the OS can't fit all the data, it may page it out to a harddrive, which makes the technique worthless (unless you want the OS to handle reading the file in chunks).

Once the file is in memory, searching technique may have negligible performance increases.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM