Using AsParallel() and/or Parallel.ForEach on a virtual machine

Question

Our web app is hosted on a virtual machine with 8 vCPUs. We have an intensive data operation that runs on a nightly schedule (console app / windows task scheduler) which I'd like to parallelize somehow. The operation iterates many times over many data sets to calculate different statistics. Currently when it runs, task manager shows that its CPU usage never goes above 13%.

Here is code from one of the methods that gets called (the web app is a large questionnaire):

Dictionary<string, List<decimal>> decimalStats = new Dictionary<string, List<decimal>>();

using (var db = new PDBContext())
{
    IEnumerable<FinancialYear> financialYears = db.FinancialYears;
    IEnumerable<Section> sections;
    IEnumerable<Question> questions;

    IQueryable<int> orgIds = db.Organisations.Where(l => l.Sector.IndustryID == 1).Select(m => m.OrganisationID);
    IQueryable<int> subSectionIds;

    foreach (var financialYear in financialYears)
    {
        sections = db.Sections.Where(l => orgIds.Contains(l.OrganisationID) && l.FinancialYearID == financialYear.FinancialYearID && l.IsVerified.Value);

        foreach (var section in sections)
        {
            subSectionIds = db.SubSections.Where(l => l.SectionID == section.SectionID).Select(m => m.SubSectionID);                
            questions = db.Questions.Where(l => subSectionIds.Contains(l.SubSectionID.Value));

            foreach (var question in questions)
            {
                var answer = db.Answers.Where(l => l.QuestionID == question.QuestionID && l.OrganisationID == section.OrganisationID && l.FinancialYearID == financialYear.FinancialYearID).FirstOrDefault();

                if (answer != null)
                {
                    string key = question.QuestionID + "#" + financialYear.FinancialYearID;

                    decimal val;
                    if (decimal.TryParse(answer.Text, out val))
                    {
                        if (decimalStats.ContainsKey(key))
                        {
                            ((List<decimal>)decimalStats[key]).Add(val);
                        }
                        else
                        {
                            List<decimal> vals = new List<decimal>();
                            vals.Add(val);
                            decimalStats.Add(key, vals);
                        }
                    }
                }
            }
        }
    }

    foreach (KeyValuePair<string, List<decimal>> entry in decimalStats)
    {
        List<decimal> vals = ((List<decimal>)entry.Value).OrderBy(l => l).ToList();

        if (vals.Count > 0)
        {
            // lots of stuff to calculate various statistics about the data
        }
    }
}

I have simplified the code above a lot. I hope it isolates the area/s in which I can make use of some parallel execution.

I've tried different combinations of using:

IEnumerable<FinancialYear> financialYears = db.FinancialYears.AsParallel();

Parallel.ForEach(financialYears, financialYear => { });

sections = db.Sections.Where(l => orgIds.Contains(l.OrganisationID) && l.FinancialYearID == financialYear.FinancialYearID && l.IsVerified.Value).AsParallel();

...but nothing I do pushes CPU usage above 13% and the time taken to execute the method stays pretty much the same. What trick am I missing here? Parallel programming is new to me so I'm trying to make use of PLINQ/TPL as simply as possible.

Answer 1

The problem is most probably in database querying than in CPU.

Instead of trying to parallelize the CPU operations, I would recommend focusing on minimizing number of queries and maximizing number of data that come from those queries.

For example this line:

var answer = db.Answers.Where(l => l.QuestionID == question.QuestionID && l.OrganisationID == section.OrganisationID && l.FinancialYearID == financialYear.FinancialYearID).FirstOrDefault();

Is probably performance problem, because it is hitting database for each year, section and question, which is a lot. You should prefer preloading everyting into memory with single query and work with in-memory data.

Also, I forgot to mention: Before you even try any kind of performance optimalizations, you should profile your code . This way, you know if your problem is I/O bound or algorithmic, which will dictate way you should optimize the code.

Using AsParallel() and/or Parallel.ForEach on a virtual machine

Question

1 answers

solution1
1 2014-04-13 09:49:53

Using AsParallel() and/or Parallel.ForEach on a virtual machine

Question

1 answers

solution1 1 2014-04-13 09:49:53

solution1
1 2014-04-13 09:49:53