简体   繁体   中英

What's the best way to sort about 2.5 million records in memory in c#?

Consider I have a class

class Employee
{
    public string Id { get; set; }
    public string Type { get; set; }
    public string Identifier { get; set; }
    public object Resume { get; set; }
    public DateTime StartDate { get; set; }
    public DateTime EndDate { get; set; }
}
List<Employee> employees = LoadEmployees(); //Around 2.5 million to 3 millions employees
employees = employees
                .Where(x => x.Identifier != null)
                .OrderBy(x => x.Identifier)
                .ToArray();

I have a requirement where I want to load and sort around 2.5 million employees in memory but the Linq query gets stuck on the OrderBy clause. Any pointers on this? I have created this Employee class just to simplify my problem.

I would use the .Where(x => x.Identifier != null) clause first, since it filters some data first and then do the OrderBy . Given the fact that you have only ~2.5 million records and that they are only basic types like string and DateTime , then you should not have any problems with the memory in this case.

Edit:

I have just ran your code as a sample and indeed it is a matter of seconds (like over 15 seconds on my machine which does not have a very powerful CPU, but still, it does not get stuck):

List<Employee> employees = new List<Employee>();
for(int i=0;i<2500000;i++)
{
    employees.Add(new Employee
    {
        Id = Guid.NewGuid().ToString(),
        Identifier = Guid.NewGuid().ToString(),
        Type = i.ToString(),
        StartDate = DateTime.MinValue,
        EndDate = DateTime.Now
    });
}

var newEmployees = employees
    .Where(x => x.Identifier != null)
    .OrderBy(x => x.Identifier)
    .ToArray();

As a second edit, I have just ran some tests, and it seems that an implementation using Parallel Linq can be in some cases faster with about 1.5 seconds than the serial implementation:

var newEmployees1 = employees.AsParallel()
    .Where(x => x.Identifier != null)
    .OrderBy(x => x.Identifier)
    .ToArray();

And these are the best numbers that I got:

7599 //serial implementation
5752 //parallel linq

But the parallel tests could variate from one machine to another so I suggest making some tests yourself and if you still find a problem about this, then maybe edit the question/post another one.

Using the hint that @Igor proposed in the comment below, the parallel implementation with StringComparer.OrdinalIgnoreCase is about three times faster than the simple parallel implementation. The final (fastest) code looks like this:

var employees = employees.AsParallel()
    .Where(x => x.Identifier != null)
    .OrderBy(x => x.Identifier, StringComparer.OrdinalIgnoreCase)
    .ToArray();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM