简体   繁体   English

我怎样才能优化这个正则表达式的性能?

[英]how can i optimize the performance of this regular expression?

I'm using a regular expression to replace commas that are not contained by text qualifying quotes into tab spaces.我正在使用正则表达式将文本限定引号中未包含的逗号替换为制表符空格。 I'm running the regex on file content through a script task in SSIS.我正在通过 SSIS 中的脚本任务对文件内容运行正则表达式。 The file content is over 6000 lines long.文件内容超过 6000 行。 I saw an example of using a regex on file content that looked like this我看到了一个在文件内容上使用正则表达式的例子,看起来像这样

String FileContent = ReadFile(FilePath, ErrInfo);        
Regex r = new Regex(@"(,)(?=(?:[^""]|""[^""]*"")*$)");
FileContent = r.Replace(FileContent, "\t");

That replace can understandably take its sweet time on a decent sized file.可以理解的是,该替换可以在大小合适的文件上度过美好的时光。

Is there a more efficient way to run this regex?有没有更有效的方法来运行这个正则表达式? Would it be faster to read the file line by line and run the regex per line?逐行读取文件并每行运行正则表达式会更快吗?

It seems you're trying to convert comma separated values (CSV) into tab separated values (TSV).您似乎正在尝试将逗号分隔值 (CSV) 转换为制表符分隔值 (TSV)。

In this case, you should try to find a CSV library instead and read the fields with that library (and convert them to TSV if necessary).在这种情况下,您应该尝试查找 CSV 库,并使用该库读取字段(必要时将其转换为 TSV)。

Alternatively, you can check whether each line has quotes and use a simpler method accordingly.或者,您可以检查每一行是否有引号并相应地使用更简单的方法。

The problem is the lookahead, which looks all the way to the end on each comman, resulting in O(n 2 ) complexity, which is noticeable on long inputs.问题在于前瞻,它在每个命令上一直查找到末尾,导致 O(n 2 ) 复杂度,这在长输入时很明显。 You can get it done in a single pass by skipping over quotes while replacing:您可以通过在替换时跳过引号来一次性完成:

Regex csvRegex = new Regex(@"
    (?<Quoted>
        ""                  # Open quotes
        (?:[^""]|"""")*     # not quotes, or two quotes (escaped)
        ""                  # Closing quotes
    )
    |                       # OR
    (?<Comma>,)             # A comma
    ",
RegexOptions.IgnorePatternWhitespace);
content = csvRegex.Replace(content,
                        match => match.Groups["Comma"].Success ? "\t" : match.Value);

Here we match free command and quoted strings.在这里,我们匹配自由命令和引用的字符串。 The Replace method takes a callback with a condition that checks if we found a comma or not, and replaced accordingly. Replace方法接受一个带有条件的回调,该条件检查我们是否找到逗号,并相应地替换。

The simplest optimization would be最简单的优化是

Regex r = new Regex(@"(,)(?=(?:[^""]|""[^""]*"")*$)", RegexOptions.Compiled);
foreach (var line in System.IO.File.ReadAllLines("input.txt"))
    Console.WriteLine(r.Replace(line, "\t"));

I haven't profiled it, but I wouldn't be surprised if the speedup was huge.我没有对其进行分析,但如果加速很大,我不会感到惊讶。

If that's not enough I suggest some manual labour:如果这还不够,我建议一些体力劳动:

var input = new StreamReader(File.OpenRead("input.txt"));

char[] toMatch = ",\"".ToCharArray ();
string line;
while (null != (line = input.ReadLine()))
{
    var result = new StringBuilder(line);
    bool inquotes = false;

    for (int index=0; -1 != (index = line.IndexOfAny (toMatch, index)); index++)
    {
        bool isquote = (line[index] == '\"');
        inquotes = inquotes != isquote;

        if (!(isquote || inquotes))
            result[index] = '\t';
    }
    Console.WriteLine (result);
}

PS: I assumed @"\t" was a typo for "\t" , but perhaps it isn't:) PS:我认为@"\t""\t" \t" 的错字,但也许不是:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM