Why is my C++ text file parsing script so much slower than my Python script?

Question

I am currently trying to teach myself c++, and I am working on file IO. I have read through the cplusplus.com tutorial, and am using the basic file IO techniques I learned there:

std::ifstream  \\using this to open a read-only file
std::ofstream  \\using this to create an output file
std::getline  \\using this to read each line of the file
outputfile << linecontents \\using this to write to the output file

I have an approximately 10MB text file containing the first million primes, which are separated by whitespace, 8 primes to a line. My goal is to write a program which will open the file, read through the contents, and write a new file with one prime number per line. I am using regular expressions to strip the whitespace on the ends of each line, and to replace the whitespace between each number with a single newline character.

The basic algorithm is simple: using regular expressions, I trim the whitespace on the ends of each line, and replace the whitespace in the middle with a newline character, and write that string to the output file. I have written the 'same' algorithm in c++ and Python (except I use the built-in strip() function to remove leading and trailing whitespace), and the Python program is much quicker! I expect the opposite; I would think that a (well-written) c++ program should be lightning fast, and a Python program 10-20 times slower. Whatever optimization is done behind-the-scenes in Python, though is making it way faster than my 'equivalent' c++ program.

My regex searches:

std::tr1::regex rxLeadingTrailingWS("^(\\s)+|(\\s)+$"); //whitespace at beginning or end of string
std::tr1::regex rxWS("(\\s)+"); //whitespace anywhere

My file-parsing code:

void ReWritePrimesFile()
{
    std::ifstream readFile("..//primes1.txt");
    std::ofstream reducedPrimeList("..//newprimelist.txt");
    std::string readout;
    std::string tempLine;

    std::tr1::regex rxLeadingTrailingWS("^(\\s)+|(\\s)+$"); //whitespace at beginning or end of string
    std::tr1::regex rxWS("(\\s)+"); //whitespace anywhere
    std::tr1::cmatch res; //the variable which a regex_search writes its results to

    while (std::getline(readFile, readout)){
        tempLine = std::tr1::regex_replace(readout.c_str(), rxLeadingTrailingWS, ""); //remove leading and trailing whitespace
        reducedPrimeList << std::tr1::regex_replace(tempLine.c_str(), rxWS, "\n") << "\n"; //replace all other whitespace with newlines
    }

    reducedPrimeList.close();
}

However, this code is taking minutes to parse through a 10 MB file. The following Python script takes approx 1-3 seconds (haven't timed it):

import re
rxWS = r'\s+'
with open('pythonprimeoutput.txt', 'w') as newfile:
    with open('primes1.txt', 'r') as f:
        for line in f.readlines():
            newfile.write(re.sub(rxWS, "\n", line.strip()) + "\n")

The only notable difference is that I'm using the built-in strip() function to strip newlines instead of using a regular expression. (Is this the source of my terribly slow execution time?)

I'm not sure at all where the horrible inefficiency in my program is coming from. A 10MB file should not take this long to parse through!

*edited: originally showed the file at 20MB, it's only 10MB.

Per Nathan Oliver's suggestion, I used the following code, which still took about 5 minutes to run. This is now pretty much the same algorithm I used in Python. Still not sure what's different.

void ReWritePrimesFile()
{
    std::ifstream readFile("..//primes.txt");
    std::ofstream reducedPrimeList("..//newprimelist.txt");
    std::string readout;
    std::string tempLine;

    //std::tr1::regex rxLeadingTrailingWS("^(\\s)+|(\\s)+$"); //whitespace at beginning or end of string
    std::tr1::regex rxWS("(\\s)+"); //whitespace anywhere
    std::tr1::cmatch res; //the variable which a regex_search writes its results to

    while (readFile >> readout){
        reducedPrimeList << std::tr1::regex_replace(readout.c_str(), rxWS, "\n") + "\n"; //replace all whitespace with newlines
    }

    reducedPrimeList.close();
}

second edit: I had to add an additional newline character at the end of the regex_replace line. Apparently the readFile >> readout stops at every whitespace character? Not sure how it works, but it runs an iteration of the while loop for each number in the file, not for each line in the file.

Answer 1

The code you have is slower because you are doing two regex calls in the C++ code. Just so you know if you use the >> operator to read from the file and it will ignore leading white space and read until another white space character is found. You could easily write your function like:

void ReWritePrimesFile()
{
    std::ifstream readFile("..//primes1.txt");
    std::ofstream reducedPrimeList("..//newprimelist.txt");
    std::string readout;

    while(readFile >> readout)
        reducedPrimeList << readout << '\n';
}

Why is my C++ text file parsing script so much slower than my Python script?

Question

1 answers

solution1
5 2015-05-12 17:07:29

Why is my C++ text file parsing script so much slower than my Python script?

Question

1 answers

solution1 5 2015-05-12 17:07:29

solution1
5 2015-05-12 17:07:29