简体   繁体   English

如何加快这段代码?

[英]How to speed up this code?

I got the following method which is used to read a txt file and return a dictionary. 我得到了以下方法,用于读取txt文件并返回字典。 It takes ~7 mins to read a ~5MB file (67000 lines, 70 chars in each line). 读取~5MB文件需要约7分钟(67000行,每行70个字符)。

public static Dictionary<string, string> FASTAFileReadIn(string file)
{
    Dictionary<string, string> seq = new Dictionary<string, string>();

    Regex re;
    Match m;
    GroupCollection group;
    string currentName = string.Empty;

    try
    {
        using (StreamReader sr = new StreamReader(file))
        {
            string line = string.Empty;
            while ((line = sr.ReadLine()) != null)
            {
                if (line.StartsWith(">"))
                {// Match Sequence
                    re = new Regex(@"^>(\S+)");
                    m = re.Match(line);
                    if (m.Success)
                    {
                        group = m.Groups;
                        if (!seq.ContainsKey(group[1].Value))
                        {
                            seq.Add(group[1].Value, string.Empty);
                            currentName = group[1].Value;
                        }
                    }
                }
                else if (Regex.Match(line.Trim(), @"\S+").Success &&
                            currentName != string.Empty)
                {
                    seq[currentName] += line.Trim();
                }
            }
        }
    }
    catch (IOException e)
    {
        Console.WriteLine("An IO exception has benn thrown!");
        Console.WriteLine(e.ToString());
    }
    finally { }

    return seq;
}

Which part of the code is most time consuming and how to speed it up? 哪部分代码最耗时,以及如何加快速度?

Thanks 谢谢

I hope the compiler would do this automatically, but the first thing I notice is you're re-compiling the regular expression on every matching line: 我希望编译器会自动执行此操作,但我注意到的第一件事是你在每个匹配的行上重新编译正则表达式:

            while ((line = sr.ReadLine()) != null)
            {
                if (line.StartsWith(">"))
                {// Match Sequence
                    re = new Regex(@"^>(\S+)");

Even better if you can remove the regular expressions completely; 如果你可以完全删除正则表达式,那就更好了; most languages provide a split function of some sort that often smokes regular expressions... 大多数语言提供某种类型的split功能,通常会抽取正则表达式...

Cache and compile regular expressions, reorder conditionals, lessen number of trimmings, and such. 缓存并编译正则表达式,重新排序条件,减少修剪次数等。

public static Dictionary<string, string> FASTAFileReadIn(string file) {
    var seq = new Dictionary<string, string>();

    Regex re = new Regex(@"^>(\S+)", RegexOptions.Compiled);
    Regex nonWhitespace = new Regex(@"\S", RegexOptions.Compiled);
    Match m;
    string currentName = string.Empty;

    try {
        foreach(string line in File.ReadLines(file)) {
            if(line[0] == '>') {
                m = re.Match(line);

                if(m.Success) {
                    if(!seq.ContainsKey(m.Groups[1].Value)) {
                        seq.Add(m.Groups[1].Value, string.Empty);
                        currentName = m.Groups[1].Value;
                    }
                }
            } else if(currentName != string.Empty) {
                if(nonWhitespace.IsMatch(line)) {
                    seq[currentName] += line.Trim();
                }
            }
        }
    } catch(IOException e) {
        Console.WriteLine("An IO exception has been thrown!");
        Console.WriteLine(e.ToString());
    }

    return seq;
}

However , that's just a naïve optimization. 然而 ,这只是一个天真的优化。 Reading up on the FASTA format, I wrote this: 阅读FASTA格式,我写道:

public static Dictionary<string, string> ReadFasta(string filename) {
    var result = new Dictionary<string, string>
    var current = new StringBuilder();
    string currentKey = null;

    foreach(string line in File.ReadLines(filename)) {
        if(line[0] == '>') {
            if(currentKey != null) {
                result.Add(currentKey, current.ToString());
                current.Clear();
            }

            int i = line.IndexOf(' ', 2);

            currentKey = i > -1 ? line.Substring(1, i - 1) : line.Substring(1);
        } else if(currentKey != null) {
            current.Append(line.TrimEnd());
        }
    }

    if(currentKey != null)
        result.Add(currentKey, current.ToString());

    return result;
}

Tell me if it works; 告诉我它是否有效; it should be much faster. 它应该快得多。

You can improve the reading speed substantially by using a BufferedStream : 使用BufferedStream可以大大提高读取速度:

using (FileStream fs = File.Open(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
    // Use the StreamReader
}

The Regex recompile @sarnold mentioned is probably your largest performance killer, though, if your processing time is ~5 minutes. 如果您的处理时间约为5分钟,那么提到的Regex重新编译@sarnold可能是您最大的性能杀手。

Here's how I would write it. 这是我写它的方式。 Without more information (ie how long the average dictionary entry is) I can't optimize the StingBuilder capacity. 没有更多信息(即平均字典条目的持续时间),我无法优化StingBuilder的容量。 You could also follow Eric J.'s advice and add a BufferedStream . 你也可以按照Eric J.的建议添加一个BufferedStream Ideally, you'd do away with Regular Expressions entirely if you want to crank out the performance, but they are a lot easier to write and manage, so I understand why you'd want to use them. 理想情况下,如果你想要提高性能,你会完全取消Regular Expressions ,但是它们更易于编写和管理,所以我理解你为什么要使用它们。

public static Dictionary<string, StringBuilder> FASTAFileReadIn(string file)
{
    var seq = new Dictionary<string, StringBuilder>();
    var regName = new Regex("^>(\\S+)", RegexOptions.Compiled);
    var regAppend = new Regex("\\S+", RegexOptions.Compiled);

    Match tempMatch = null;
    string currentName = string.Empty;
    try
    {
        using (StreamReader sReader = new StreamReader(file))
        {
            string line = string.Empty;
            while ((line = sReader.ReadLine()) != null)
            {
                if ((tempMatch = regName.Match(line)).Success)
                {
                    if (!seq.ContainsKey(tempMatch.Groups[1].Value))
                    {
                        currentName = tempMatch.Groups[1].Value;
                        seq.Add(currentName, new StringBuilder());
                    }
                }
                else if ((tempMatch = regAppend.Match(line)).Success && currentName != string.Empty)
                {
                    seq[currentName].Append(tempMatch.Value);
                }
            }
        }
    }
    catch (IOException e)
    {
        Console.WriteLine("An IO exception has been thrown!");
        Console.WriteLine(e.ToString());
    }

    return seq;
}

As you can see, I've slightly changed your dictionary to use the optimized StringBuilder class for appending values. 如您所见,我稍微更改了您的字典以使用优化的StringBuilder类来附加值。 I've also pre-compiled the regular expressions once and once only to ensure that you aren't redundantly recompiling the same regular expression over and over again. 我还预先编译了一次正则表达式,只是为了确保你不会一遍又一遍地重复编译相同的正则表达式。 I've also extracted your "append" case to compile into a Regular Expression as well. 我还提取了你的“追加”案例,以便编译成正则表达式。

Let me know if this helps you out performance-wise. 如果这有助于您提高性能,请告诉我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM