how can i optimize the performance of this regular expression?

Question

I'm using a regular expression to replace commas that are not contained by text qualifying quotes into tab spaces. I'm running the regex on file content through a script task in SSIS. The file content is over 6000 lines long. I saw an example of using a regex on file content that looked like this

String FileContent = ReadFile(FilePath, ErrInfo);        
Regex r = new Regex(@"(,)(?=(?:[^""]|""[^""]*"")*$)");
FileContent = r.Replace(FileContent, "\t");

That replace can understandably take its sweet time on a decent sized file.

Is there a more efficient way to run this regex? Would it be faster to read the file line by line and run the regex per line?

Answer 1

It seems you're trying to convert comma separated values (CSV) into tab separated values (TSV).

In this case, you should try to find a CSV library instead and read the fields with that library (and convert them to TSV if necessary).

Alternatively, you can check whether each line has quotes and use a simpler method accordingly.

Answer 2

The problem is the lookahead, which looks all the way to the end on each comman, resulting in O(n ² ) complexity, which is noticeable on long inputs. You can get it done in a single pass by skipping over quotes while replacing:

Regex csvRegex = new Regex(@"
    (?<Quoted>
        ""                  # Open quotes
        (?:[^""]|"""")*     # not quotes, or two quotes (escaped)
        ""                  # Closing quotes
    )
    |                       # OR
    (?<Comma>,)             # A comma
    ",
RegexOptions.IgnorePatternWhitespace);
content = csvRegex.Replace(content,
                        match => match.Groups["Comma"].Success ? "\t" : match.Value);

Here we match free command and quoted strings. The Replace method takes a callback with a condition that checks if we found a comma or not, and replaced accordingly.

Answer 3

The simplest optimization would be

Regex r = new Regex(@"(,)(?=(?:[^""]|""[^""]*"")*$)", RegexOptions.Compiled);
foreach (var line in System.IO.File.ReadAllLines("input.txt"))
    Console.WriteLine(r.Replace(line, "\t"));

I haven't profiled it, but I wouldn't be surprised if the speedup was huge.

If that's not enough I suggest some manual labour:

var input = new StreamReader(File.OpenRead("input.txt"));

char[] toMatch = ",\"".ToCharArray ();
string line;
while (null != (line = input.ReadLine()))
{
    var result = new StringBuilder(line);
    bool inquotes = false;

    for (int index=0; -1 != (index = line.IndexOfAny (toMatch, index)); index++)
    {
        bool isquote = (line[index] == '\"');
        inquotes = inquotes != isquote;

        if (!(isquote || inquotes))
            result[index] = '\t';
    }
    Console.WriteLine (result);
}

PS: I assumed @"\t" was a typo for "\t" , but perhaps it isn't:)

how can i optimize the performance of this regular expression?

Question

3 answers

solution1
6 2011-07-08 18:24:35

solution2
4 ACCPTED 2011-07-08 18:37:59

solution3
2 2011-07-08 19:20:20

how can i optimize the performance of this regular expression?

Question

3 answers

solution1 6 2011-07-08 18:24:35

solution2 4 ACCPTED 2011-07-08 18:37:59

solution3 2 2011-07-08 19:20:20

solution1
6 2011-07-08 18:24:35

solution2
4 ACCPTED 2011-07-08 18:37:59

solution3
2 2011-07-08 19:20:20