[英]What's the best way to sort about 2.5 million records in memory in c#?
Consider I have a class 考虑我有课
class Employee
{
public string Id { get; set; }
public string Type { get; set; }
public string Identifier { get; set; }
public object Resume { get; set; }
public DateTime StartDate { get; set; }
public DateTime EndDate { get; set; }
}
List<Employee> employees = LoadEmployees(); //Around 2.5 million to 3 millions employees
employees = employees
.Where(x => x.Identifier != null)
.OrderBy(x => x.Identifier)
.ToArray();
I have a requirement where I want to load and sort around 2.5 million employees in memory but the Linq query gets stuck on the OrderBy
clause. 我有一个要求,我想在内存中加载和排序约250万名员工,但是Linq查询卡在
OrderBy
子句中。 Any pointers on this? 关于这个有什么建议吗? I have created this
Employee
class just to simplify my problem. 我创建了这个
Employee
类只是为了简化我的问题。
I would use the .Where(x => x.Identifier != null)
clause first, since it filters some data first and then do the OrderBy
. 我将首先使用
.Where(x => x.Identifier != null)
子句,因为它首先过滤一些数据,然后执行OrderBy
。 Given the fact that you have only ~2.5 million records and that they are only basic types like string
and DateTime
, then you should not have any problems with the memory in this case. 考虑到您只有约250万条记录,并且它们只是基本类型(例如
string
和DateTime
,因此在这种情况下,内存应该没有任何问题。
Edit: 编辑:
I have just ran your code as a sample and indeed it is a matter of seconds (like over 15 seconds on my machine which does not have a very powerful CPU, but still, it does not get stuck): 我只是将您的代码作为示例运行,实际上只需几秒钟(例如在我的机器上超过15秒,它没有非常强大的CPU,但仍然不会卡住):
List<Employee> employees = new List<Employee>();
for(int i=0;i<2500000;i++)
{
employees.Add(new Employee
{
Id = Guid.NewGuid().ToString(),
Identifier = Guid.NewGuid().ToString(),
Type = i.ToString(),
StartDate = DateTime.MinValue,
EndDate = DateTime.Now
});
}
var newEmployees = employees
.Where(x => x.Identifier != null)
.OrderBy(x => x.Identifier)
.ToArray();
As a second edit, I have just ran some tests, and it seems that an implementation using Parallel Linq can be in some cases faster with about 1.5 seconds than the serial implementation: 作为第二个编辑,我刚刚进行了一些测试,似乎在某些情况下,使用Parallel Linq的实现可能比串行实现快1.5秒左右:
var newEmployees1 = employees.AsParallel()
.Where(x => x.Identifier != null)
.OrderBy(x => x.Identifier)
.ToArray();
And these are the best numbers that I got: 这些是我得到的最好的数字:
7599 //serial implementation
5752 //parallel linq
But the parallel tests could variate from one machine to another so I suggest making some tests yourself and if you still find a problem about this, then maybe edit the question/post another one. 但是并行测试可能会因一台机器而异,因此我建议您自己进行一些测试,如果仍然发现问题,则可以编辑问题/发布另一台。
Using the hint that @Igor proposed in the comment below, the parallel implementation with StringComparer.OrdinalIgnoreCase
is about three times faster than the simple parallel implementation. 使用@Igor在下面的注释中提出的提示,使用
StringComparer.OrdinalIgnoreCase
的并行实现比简单的并行实现快大约三倍。 The final (fastest) code looks like this: 最终(最快)的代码如下所示:
var employees = employees.AsParallel()
.Where(x => x.Identifier != null)
.OrderBy(x => x.Identifier, StringComparer.OrdinalIgnoreCase)
.ToArray();
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.