How to speed up this code?

Question

I got the following method which is used to read a txt file and return a dictionary. It takes ~7 mins to read a ~5MB file (67000 lines, 70 chars in each line).

public static Dictionary<string, string> FASTAFileReadIn(string file)
{
    Dictionary<string, string> seq = new Dictionary<string, string>();

    Regex re;
    Match m;
    GroupCollection group;
    string currentName = string.Empty;

    try
    {
        using (StreamReader sr = new StreamReader(file))
        {
            string line = string.Empty;
            while ((line = sr.ReadLine()) != null)
            {
                if (line.StartsWith(">"))
                {// Match Sequence
                    re = new Regex(@"^>(\S+)");
                    m = re.Match(line);
                    if (m.Success)
                    {
                        group = m.Groups;
                        if (!seq.ContainsKey(group[1].Value))
                        {
                            seq.Add(group[1].Value, string.Empty);
                            currentName = group[1].Value;
                        }
                    }
                }
                else if (Regex.Match(line.Trim(), @"\S+").Success &&
                            currentName != string.Empty)
                {
                    seq[currentName] += line.Trim();
                }
            }
        }
    }
    catch (IOException e)
    {
        Console.WriteLine("An IO exception has benn thrown!");
        Console.WriteLine(e.ToString());
    }
    finally { }

    return seq;
}

Which part of the code is most time consuming and how to speed it up?

Thanks

Answer 1

I hope the compiler would do this automatically, but the first thing I notice is you're re-compiling the regular expression on every matching line:

            while ((line = sr.ReadLine()) != null)
            {
                if (line.StartsWith(">"))
                {// Match Sequence
                    re = new Regex(@"^>(\S+)");

Even better if you can remove the regular expressions completely; most languages provide a split function of some sort that often smokes regular expressions...

Answer 2

Cache and compile regular expressions, reorder conditionals, lessen number of trimmings, and such.

public static Dictionary<string, string> FASTAFileReadIn(string file) {
    var seq = new Dictionary<string, string>();

    Regex re = new Regex(@"^>(\S+)", RegexOptions.Compiled);
    Regex nonWhitespace = new Regex(@"\S", RegexOptions.Compiled);
    Match m;
    string currentName = string.Empty;

    try {
        foreach(string line in File.ReadLines(file)) {
            if(line[0] == '>') {
                m = re.Match(line);

                if(m.Success) {
                    if(!seq.ContainsKey(m.Groups[1].Value)) {
                        seq.Add(m.Groups[1].Value, string.Empty);
                        currentName = m.Groups[1].Value;
                    }
                }
            } else if(currentName != string.Empty) {
                if(nonWhitespace.IsMatch(line)) {
                    seq[currentName] += line.Trim();
                }
            }
        }
    } catch(IOException e) {
        Console.WriteLine("An IO exception has been thrown!");
        Console.WriteLine(e.ToString());
    }

    return seq;
}

However , that's just a naïve optimization. Reading up on the FASTA format, I wrote this:

public static Dictionary<string, string> ReadFasta(string filename) {
    var result = new Dictionary<string, string>
    var current = new StringBuilder();
    string currentKey = null;

    foreach(string line in File.ReadLines(filename)) {
        if(line[0] == '>') {
            if(currentKey != null) {
                result.Add(currentKey, current.ToString());
                current.Clear();
            }

            int i = line.IndexOf(' ', 2);

            currentKey = i > -1 ? line.Substring(1, i - 1) : line.Substring(1);
        } else if(currentKey != null) {
            current.Append(line.TrimEnd());
        }
    }

    if(currentKey != null)
        result.Add(currentKey, current.ToString());

    return result;
}

Tell me if it works; it should be much faster.

Answer 3

You can improve the reading speed substantially by using a BufferedStream :

using (FileStream fs = File.Open(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
    // Use the StreamReader
}

The Regex recompile @sarnold mentioned is probably your largest performance killer, though, if your processing time is ~5 minutes.

Answer 4

Here's how I would write it. Without more information (ie how long the average dictionary entry is) I can't optimize the StingBuilder capacity. You could also follow Eric J.'s advice and add a BufferedStream . Ideally, you'd do away with Regular Expressions entirely if you want to crank out the performance, but they are a lot easier to write and manage, so I understand why you'd want to use them.

public static Dictionary<string, StringBuilder> FASTAFileReadIn(string file)
{
    var seq = new Dictionary<string, StringBuilder>();
    var regName = new Regex("^>(\\S+)", RegexOptions.Compiled);
    var regAppend = new Regex("\\S+", RegexOptions.Compiled);

    Match tempMatch = null;
    string currentName = string.Empty;
    try
    {
        using (StreamReader sReader = new StreamReader(file))
        {
            string line = string.Empty;
            while ((line = sReader.ReadLine()) != null)
            {
                if ((tempMatch = regName.Match(line)).Success)
                {
                    if (!seq.ContainsKey(tempMatch.Groups[1].Value))
                    {
                        currentName = tempMatch.Groups[1].Value;
                        seq.Add(currentName, new StringBuilder());
                    }
                }
                else if ((tempMatch = regAppend.Match(line)).Success && currentName != string.Empty)
                {
                    seq[currentName].Append(tempMatch.Value);
                }
            }
        }
    }
    catch (IOException e)
    {
        Console.WriteLine("An IO exception has been thrown!");
        Console.WriteLine(e.ToString());
    }

    return seq;
}

As you can see, I've slightly changed your dictionary to use the optimized StringBuilder class for appending values. I've also pre-compiled the regular expressions once and once only to ensure that you aren't redundantly recompiling the same regular expression over and over again. I've also extracted your "append" case to compile into a Regular Expression as well.

Let me know if this helps you out performance-wise.

How to speed up this code?

Question

4 answers

solution1
3 2012-07-24 03:08:42

solution2
2 ACCPTED 2012-07-24 03:14:27

solution3
1 2012-07-24 03:10:49

solution4
1 2012-07-24 03:32:38

How to speed up this code?

Question

4 answers

solution1 3 2012-07-24 03:08:42

solution2 2 ACCPTED 2012-07-24 03:14:27

solution3 1 2012-07-24 03:10:49

solution4 1 2012-07-24 03:32:38

solution1
3 2012-07-24 03:08:42

solution2
2 ACCPTED 2012-07-24 03:14:27

solution3
1 2012-07-24 03:10:49

solution4
1 2012-07-24 03:32:38