LINQ c# efficiency

Question

I need to write a query pulling distinct values from columns defined by a user for any given data set. There could be millions of rows so the statements must be as efficient as possible. Below is the code I have.

What is the order of this LINQ query? Is there a more efficient way of doing this?

var MyValues = from r in MyDataTable.AsEnumerable()
               orderby r.Field<double>(_varName)
               select r.Field<double>(_varName); 

IEnumerable result= MyValues.Distinct();

Answer 1

I can't speak much to the AsEnumerable() call or the field conversions, but for the LINQ side of things, the orderby is a stable quick sort and should be O(n log n) . If I had to guess, everything but the orderby should be O(n) , so overall you're still just O(n log n) .

Update: the LINQ Distinct() call should also be O(n) .

So altogether, the Big-Oh for this thing is still O(Kn log n) , where K is some constant.

Answer 2

Is there a more efficent way of doing this?

You could get better efficiency if you do the sort as part of the query that initializes MyDataTable , instead of sorting in memory afterwards.

Answer 3

_{from comments}

I actually use MyDistinct.Distinct()

If you want distinct _varName values and you cannot do this all in the select query in dbms(what would be the most efficient way), you should use Distinct before OrderBy . The order matters here.

You would need to order all million of rows before you start to filter out the duplicates. If you use distinct first, you need to order only the rest.

var values = from r in MyDataTable.AsEnumerable()
             select r.Field<double>(_varName);
IEnumerable<double> orderedDistinctValues = values.Distinct()
                                                  .OrderBy(d => d);

I have asked a related question recently which E.Lippert answered with a good explanation when order matters and when not:

Order of LINQ extension methods does not affect performance?

Here's a little demo where you can see that the order matters, but you can also see that it does not really matter since comparing doubles is trivial for a cpu:

Time for first orderby then distinct: 00:00:00.0045379
Time for first distinct then orderby: 00:00:00.0013316

Answer 4

your above query (linq) is good if you want all the million records and you have enough memory on a 64bit memory addressing OS.

the order of the query is, if you see the underlying command, would be transalated to

Select <_varname> from MyDataTable order by <_varname>

and this is as good as it is when run on the database IDE or commandline.

to give you a short answer regarding performance

put in a where clause if you can (with columns that are indexed)
ensure that the user can choose colums (_varname) that are indexed. Imagine the DB trying to sort million records on an unindexed column, which is evidently slow, but endangers linq to receive the badpress
Ensure that (if possible) initilisation of the MyDataTable is done correctly with the records that are of value (again based on a where clause)
profile your underlying query,
if possible, create storedprocs (debatable). you can create an entity model which includes storedprocs aswell

it may be faster today, but with the tablespace growing, and if your data is not ordered (indexed) thats where things get slowerr (even if you had a good linq expression)

Hope this helps

that said, if your db is not properly indexed, meaning

LINQ c# efficiency

Question

4 answers

solution1
6 ACCPTED 2012-10-01 20:58:35

solution2
1 2012-10-01 21:02:19

solution3
1 2012-10-01 21:10:20

solution4
0 2012-10-01 21:14:27

LINQ c# efficiency

Question

4 answers

solution1 6 ACCPTED 2012-10-01 20:58:35

solution2 1 2012-10-01 21:02:19

solution3 1 2012-10-01 21:10:20

solution4 0 2012-10-01 21:14:27

solution1
6 ACCPTED 2012-10-01 20:58:35

solution2
1 2012-10-01 21:02:19

solution3
1 2012-10-01 21:10:20

solution4
0 2012-10-01 21:14:27