如何用两个相关表优化一个简单的LINQ查询？

Question

I have two related tables with the following structure: 我有两个具有以下结构的相关表：

'patients' : '耐心' ：

{ Id = 1, Surname = Smith998 }
...
{ Id = 1000, Surname = Smith1000 }

and the second is 'receptions' : 第二个是“接待”：

{ PatientId = 1, ReceptionStart = 3/3/2017 1:14:00 AM }
{ PatientId = 1, ReceptionStart = 1/7/2016 1:14:00 AM }
...
{ PatientId = 1000, ReceptionStart = 1/23/2017 1:14:00 AM }

the tables are not from a database, but they are generated with the following sample code: 这些表不是来自数据库，而是使用以下示例代码生成的：

        var rand = new Random();
        var receptions = Enumerable.Range(1, 1000).SelectMany(pid => Enumerable.Range(1, rand.Next(0, 10)).Select(rid => new { PatientId = pid, ReceptionStart = DateTime.Now.AddDays(-rand.Next(1, 500)) })).ToList();
        var patients = Enumerable.Range(1, 1000).Select(pid => new { Id = pid, Surname = string.Format("Smith{0}", pid) }).ToList();

The question is what is the optimal way to select the patients that have receptions before 1/1/2017? 问题是选择2017年1月1日之前接受接待的患者的最佳方法是什么？

Of cause I can write something like this: 当然，我可以这样写：

        var cured_receptions = (from r in receptions where r.ReceptionStart < new DateTime(2017, 7, 1) select r.PatientId).Distinct();
        var cured_patients = from p in patients where cured_receptions.Contains(p.Id) select p;

but it is not clear for me what 'cured_receptions.Contains(p.Id)' code actually does? 但是我不清楚'cured_receptions.Contains（p.Id）'代码实际上是做什么的？ Does it simply iterate over all the patients searching the Id or it use something like indices in a database? 它是简单地遍历所有搜索Id的患者还是使用数据库中的索引之类的东西？ Can cured_receptions.ToDictionary() or something like this help here somehow? 硫化_receptions.ToDictionary（）或类似的东西可以在某种程度上帮助这里吗？

Answer 1

but it is not clear for me what 'cured_receptions.Contains(p.Id)' code actually does? 但是我不清楚'cured_receptions.Contains（p.Id）'代码实际上是做什么的？ Does it simply iterate over all the patients searching the Id or it use something like indices in a database? 它是简单地遍历所有搜索Id的患者还是使用数据库中的索引之类的东西？

Case 1: Interacting with Database 情况1：与数据库交互

If you were interacting with the database then no query will be sent to the database until you execute the 2nd query by calling ToList() on it or by iterating the items in cured_patients . 如果您正在与数据库进行交互，则直到您通过在数据库上调用ToList()或迭代cured_patients的项目来执行第二个查询，都不会向数据库发送查询。 The query sent to the database will be something along the lines: 发送到数据库的查询内容大致如下：

SELECT 
[Extent1].[Id] AS [Id], 
[Extent1].[Surname] AS [Surname]
FROM [dbo].[Patients] AS [Extent1]
WHERE  EXISTS (SELECT 
    1 AS [C1]
    FROM [dbo].[Receptions] AS [Extent2]
    WHERE ([Extent2].[ReceptionStart] < 
    convert(datetime2, '2017-07-01 00:00:00.0000000', 121)) 
    AND ([Extent2].[PatientId] = [Extent1].[Id])
)

Will it use any indices? 它会使用任何索引吗？

Yes if PatientId , Id and ReceptionStart are indexed, then the database server will use them. 是，如果将PatientId ， Id和ReceptionStart编入索引，则数据库服务器将使用它们。

Case 2: Interacting with items in memory 情况2：与内存中的项目进行交互

For the first query it will iterate all receptions , find the ones whose ReceptionStart is before the given date, select the PatientId and then remove any duplicate PatientId (s). 对于第一个查询，它将迭代所有receptions ，找到其ReceptionStart在给定日期之前的ReceptionStart ，选择PatientId ，然后删除所有重复的PatientId 。

Then the 2nd query which is below: 然后是下面的第二个查询：

var cured_patients = 
   from p in patients 
   where cured_receptions.Contains(p.Id) 
   select p;

Will iterate each item in patients and see if the Id of that item is found in cured_receptions . 将迭代patients每个项目，并查看是否在cured_receptions找到了该项目的Id 。 For all the items whose Id is found in the cured_receptions , it will select those items. 对于已在cured_receptions找到其Id所有项目，它将选择这些项目。 Contains simply returns a true or false . Contains仅返回true或false 。

Answer 2

Starting over assuming everything in memory only... 从仅假设内存中的所有内容开始...

Your cured_receptions isn't evaluated until called by the Contains so it would be much more efficient use put .ToList() on the end of that variable definition (about 100X faster). 直到被Contains调用时，才会评估cured_receptions ，因此将.ToList()放在该变量定义的末尾会更.ToList()将速度提高大约100倍），这样会更有效。
LINQ doesn't "search" - Contains is doing the searching. LINQ不会“搜索”- Contains正在执行搜索。 If you want to use something like a binary search or better still, a hash table, you must create it. 如果要使用二进制搜索或更高级的哈希表，则必须创建它。 If you do use a HashSet<int> then you will gain another 47X speedup. 如果您确实使用HashSet<int> ，则将获得另一个47倍的加速。 Taking the Distinct off (since the HashSet will handle that) saves another 15%. 取消Distinct （因为HashSet将处理该问题）可以节省另外15％的时间。
Remembering your constants in variables instead of creating them as you go ( new DateTime ...) may save a little more. 记住变量中的常量而不是随便创建它们（ new DateTime ...）可能会节省更多。 Even greatly increasing your random data doesn't take up enough time to tell with the HashSet . 甚至极大地增加您的随机数据也不会花费足够的时间来告诉HashSet 。
Using a join is faster than your initial query, but your query combined with a HashSet is fastest. 使用join比你最初的查询速度更快，但你的查询与合并HashSet是最快的。

So the fastest code is: 因此最快的代码是：

var cured_receptions = new HashSet<int>((from r in receptions where r.ReceptionStart < endDateTime select r.PatientId));
var cured_patients = from p in patients where cured_receptions.Contains(p.Id) select p;

Note: I used LINQPad to generate timings and sample data. 注意：我使用LINQPad生成时序和样本数据。 I changed your date parameters because your values made the majority of receptions match. 我更改了日期参数，因为您的值使大多数接收匹配。

Here is the code from my LINQPad: 这是我的LINQPad中的代码：

var rand = new Random();
var begin = DateTime.Now;
var receptions = Enumerable.Range(1, 100000).SelectMany(pid => Enumerable.Range(1, rand.Next(0, 100)).Select(rid => new { PatientId = pid, ReceptionStart = begin.AddDays(-rand.Next(1, 180)) })).ToList();
var patients = Enumerable.Range(1, 100000).Select(pid => new { Id = pid, Surname = string.Format("Smith{0}", pid) }).ToList();

var startTime = Util.ElapsedTime;
var endDateTime = new DateTime(2017, 5, 1);
//var cured_receptions = (from r in receptions where r.ReceptionStart < new DateTime(2017, 5, 1) select r.PatientId).Distinct().ToList();
//var cured_receptions = (from r in receptions where r.ReceptionStart < new DateTime(2017, 5, 1) select r.PatientId).Distinct();
//var cured_receptions = new HashSet<int>((from r in receptions where r.ReceptionStart < new DateTime(2017, 5, 1) select r.PatientId).Distinct());
//var cured_receptions = new HashSet<int>((from r in receptions where r.ReceptionStart < endDateTime select r.PatientId).Distinct());
//var cured_receptions = new HashSet<int>((from r in receptions where r.ReceptionStart < new DateTime(2017, 5, 1) select r.PatientId));
var cured_receptions = new HashSet<int>((from r in receptions where r.ReceptionStart < endDateTime select r.PatientId));
var cured_patients = from p in patients where cured_receptions.Contains(p.Id) select p;

//  var cured_patients = (from r in receptions
//                       where r.ReceptionStart < endDateTime
//                       join p in patients on r.PatientId equals p.Id
//                       select p).Distinct();

//  var cured_patients = from p in patients
//                       join r in receptions on p.Id equals r.PatientId into rj
//                       where rj.Any(r => r.ReceptionStart < endDateTime)
//                       select p;

cured_patients.Count().Dump();
var endTime = Util.ElapsedTime;

(endTime - startTime).Dump("Elapsed");

如何用两个相关表优化一个简单的LINQ查询？

问题描述

2 个解决方案

解决方案1
0 2017-07-26 23:04:45

解决方案2
0 已采纳 2017-07-26 23:16:22

如何用两个相关表优化一个简单的LINQ查询？

问题描述

2 个解决方案

解决方案1 0 2017-07-26 23:04:45

解决方案2 0 已采纳 2017-07-26 23:16:22

解决方案1
0 2017-07-26 23:04:45

解决方案2
0 已采纳 2017-07-26 23:16:22