DataTable Select vs List<T> LINQ Performance

Question

I have an app that does an SQL and loads a set of data into a datatable. As part of the processing there are 6 or 7 DataTable.Select() to filter some data. Each item that needs processing takes 300ms. There are 5000 items to process so takes 25 mins. This is unacceptable.

Would creating POCO's and loading them into a List and then using LINQ to query the list be quicker than using DataTable.Select?

Thanks

UPDATE: I have delved in a bit more and there are 2 datatables each with around 15000 records. The 2 queries used to populate the datatables take a second each. It then takes 25mins to loop over 5000 items in a Dictionary's values property and do 5 DataTable.Select's

eg/

foreach (OutputRecord Mailpiece in DictionaryMailpieces.Values)
{
    try
    {
        DataRow[] R = DataTable1.Select("MAILPIECE = " + Mailpiece.MailpieceSetSequenceNumber + " AND (STATUS = 4034 OR STATUS = 4037)", "DAL_DATE desc");
        if (R != null && R.Length > 0)
        {
        }
    }
    catch
    {
    }
}

Answer 1

Funny there is no "SQL" tag associated with your question. I suggest, you learn how to use the SQL language and its benefits. From what you say, it is likely you are, with your code, creating a lot of Cartesian products , instead of leveraging the Relational Database facilities (joins, indexes, etc.)

Using cross joins of DataTables or Lists or anything similar will always lead to heavy performance degradation, whatever language or platform is used.

That said, you could use LINQ because it's capable of producing smart SQL (dynamically), but you still want to avoid all ToList(), ToArray() and similar extension methods on IEnumerable(T) that summon all the underlying data (keep it enumerable from end to end and leverage "object streaming" whenever possible). If you understand really what is a Relational Database and how to use it efficiently, you will be a better LINQ developer.

Answer 2

Almost anything will be faster than manipulating an ADO.NET DataTable -- they are not designed for fast retrieval in any sense. You should also put the objects into an appropriate data structure; a DataTable is a red-black binary tree of rows, so if you don't want that, you shouldn't use one.

If you're just using the DataTable as a sequential collection of rows with fields, then you'll probably see a factor of 2 or more speedup just by replacing the DataTable with a List<T> and replacing your Select calls with Where calls, although it depends on what you're doing with it.

EDIT: Actually, I changed my mind. Nothing you could be doing sorting-or-filtering-wise with 5000 items in a DataTable implies a cost of anywhere close to 300ms, so the bottleneck is probably unrelated.

Answer 3

Would creating POCO's and loading them into a List and then using LINQ to query the list be quicker than using DataTable.Select?

We have no idea, you didn't give us enough information. We have no idea how your method is coded (maybe you have an errant Thread.Sleep(300) buried in your code; we can't tell).

More importantly, we need to know where the bottleneck is. To figure that out, you need a profiler. Get one and then once you know what the bottleneck is, we can probably help you eek out some extra performance.

That said, switching to LINQ probably isn't going to single-handedly be the solution to your performance problems. Something else is wrong, and whether it is coded using DataTable s and LINQ is mostly irrelevant. The performance gains are going to come from having the right plan of attack to your problem; DataTable s and LINQ are just ways of implementing that plan of attack.

Answer 4

Using LINQ will most likely not provide a huge speed improvement, in and of itself. That being said, you could potentially use PLINQ to simplify the parallelization of the processing, which could allow this to scale better on multicore systems. This tends to be much simpler when using POCOs instead of DataTable, as DataTable is not thread safe, and has concurrency issues.

That being said - I suspect that profiling this process would, overall, give you a much better potential improvement, as it would allow you to find and correct any bottlenecks. If there are no specific bottlenecks, and the process just requires that amount of raw processing, caching may also help. In addition, it's possible that leaving the data on the database and using some form of ORM may help, as well, as the "6 or 7" filter operations could be run on a scalable server instead of locally. All of this is highly dependent on the nature of your data and algorithm, however, so it would require some careful consideration to determine whether it would be beneficial or detrimental overall.

DataTable Select vs List<T> LINQ Performance

Question

4 answers

solution1
5 ACCPTED 2011-05-11 19:48:00

solution2
2 2011-05-11 19:34:49

solution3
0 2011-05-11 19:30:38

solution4
0 2011-05-11 19:31:18

DataTable Select vs List<T> LINQ Performance

Question

4 answers

solution1 5 ACCPTED 2011-05-11 19:48:00

solution2 2 2011-05-11 19:34:49

solution3 0 2011-05-11 19:30:38

solution4 0 2011-05-11 19:31:18

solution1
5 ACCPTED 2011-05-11 19:48:00

solution2
2 2011-05-11 19:34:49

solution3
0 2011-05-11 19:30:38

solution4
0 2011-05-11 19:31:18