process files in parallel C#

Question

I have this code which reads all words from the files and assigns ID to each unique word and adds it to a Dictionary. I need to make it run in parallel to increase efficiency of the application. I have tried using Parallel.ForEach instead of foreach however, using lock to add new word and ID to the Dictionary does not increase efficiency at all... Could you guys help me with this by telling what are the best ways I could parallelize this code?

    //static object locker = new object();
    string[] fnames; // Files are collected from a save file dialog
    Dictionary<string, IndexEntry> ID = new Dictionary<string, IndexEntry>(); 
    foreach (var fname in fnames)
        {

            string[] lines = File.ReadAllLines(fname);
            for (int i = 0; i < lines.Length; i++)
            {
                string[] Raw = Regex.Split(lines[i], @"\W+");

                for (int j = 0; j < Raw.Length; j++)
                {
                    string z = Raw[j];

                    if (!ID.ContainsKey(z))
                    {
                        ID.Add(z, new IndexEntry());
                    }
                }

Answer 1

The Producer/Consumer pattern is your friend here.

You can have one thread reading the file, a second thread inserting into the dictionary, and potentially a third thread doing whatever processing needs to happen. The third thread only applies if the dictionary does not have to be fully populated before that processing can begin (eg if it is sufficient for a given line to be read).

Note that, if the processing step is trivial, your gains will be minimal vs. single-threaded solution.

Check out the Task Parallel Library . That is ideally suited to this type of processing.

I use this pattern for reading, processing and writing (to a DB) rather large (1GB+) XML documents.

Answer 2

if this block of code is accessed by multiple threads Id first consider a Concurrent dictionary, which is thread safe. This will implement locking for you.

EDIT:

http://msdn.microsoft.com/en-us/library/dd287191%28v=vs.110%29.aspx

Answer 3

The problem is that your biggest time consumer is reading the file:

string[] lines = File.ReadAllLines(fname);

You're slurping it in in one fell swoop. You might have a thread for each file, but I don't think that's buying you much since their I/O is all contending for the same disk. Try doing in in smaller pieces. Something like this might do you:

static Dictionary<string,IndexEntry> ProcessFiles( IEnumerable<string> filenames )
{
  IEnumerable<string> words = filenames
                              .AsParallel()
                            //.WithMergeOptions( ParallelMergeOptions.NotBuffered )
                              .Select( x => ReadWordsFromFile(x) )
                              .SelectMany( x => x )
                              ;

  Dictionary<string,IndexEntry> index = new Dictionary<string,IndexEntry>() ;
  foreach( string word in words ) // would making this parallel speed things up? dunno.
  {
    bool found = index.ContainsKey(word) ;
    if ( !found )
    {
      index.Add( word, new IndexEntry() ) ;
    }
  }
  return index ;
}

static Regex rxWord = new Regex( @"\w+" ) ;
private static IEnumerable<string> ReadWordsFromFile( string fn )
{
  using( StreamReader sr = File.OpenText( fn ) )
  {
    string line ;
    while ( (line=sr.ReadLine()) != null )
    {
      for ( Match m = rxWord.Match(line) ; m.Success ; m = m.NextMatch() )
      {
        yield return m.Value ;
      }
    }
  }
}

process files in parallel C#

Question

3 answers

solution1
1 2014-04-28 21:34:01

solution2
0 2014-04-28 21:33:22

solution3
0 2014-04-28 22:35:13

process files in parallel C#

Question

3 answers

solution1 1 2014-04-28 21:34:01

solution2 0 2014-04-28 21:33:22

solution3 0 2014-04-28 22:35:13

solution1
1 2014-04-28 21:34:01

solution2
0 2014-04-28 21:33:22

solution3
0 2014-04-28 22:35:13