简体   繁体   English

LINQ-过滤,分组并获取最小值和最大值

[英]LINQ - filtering, grouping and getting Min and Max value

Let's say that I have an EF entity class that represents some value in time: 假设我有一个EF实体类,它表示时间上的一些值:

public class Point
{
    public DateTime DT {get; set;}
    public decimal Value {get; set;}
}

I have also a class that represents some time period: 我也有一个代表一段时间的课程:

public class Period
{
    public DateTime Begin {get; set;}
    public DateTime End {get; set;}
}

Then I have an array of Period 's that can contain some specific time slices, let's say that it looks like ( Period objects are always in ascending order in the array): 然后,我有一个Period数组,其中可以包含一些特定的时间片,让我们说它看起来像( Period对象在数组中始终按升序排列):

var periodSlices = new Period [] 
{
    new Period { Begin = new DateTime(2016, 10, 1), End = new DateTime(2016, 10, 15)},
    new Period { Begin = new DateTime(2016, 10, 16), End = new DateTime(2016, 10, 20)},
    new Period { Begin = new DateTime(2016, 10, 21), End = new DateTime(2016, 12, 30)}
};

Now, using LINQ to SQL how to write a query which would filter out and group the Point 's with oldest(min) and latest(max) values within each of periodSlices , so in this example scenario a results should have a 3 groups with min and max points (if any of course). 现在,使用LINQ to SQL如何编写一个查询,该查询将在每个periodSlices以最旧(最小)和最新(最大)值对Point过滤并将其分组,因此在此示例场景中,结果应具有3组,最小和最大点(如果有的话)。

So what I need to have as a result is something like IQueryable<Period, IEnumerable<Point>> . 因此,我需要的是类似IQueryable<Period, IEnumerable<Point>>

Right now I am doing it this way, but the performance is not the greatest: 现在,我正在这样做,但是性能并不是最好的:

using (var context = new EfDbContext())
{
    var periodBegin = periodSlices[0].Begin;
    var periodEnd = periodSlices[periodSlices.Length - 1].End;

     var dbPoints = context.Points.Where(p => p.DT >= periodBegin && p.DT <= periodEnd).ToArray();

    foreach (var slice in periodSlices)
    {
        var points = dbPoints.Where(p => p.DT >= slice.Begin && p.DT <= slice.End);

        if (points.Any())
        {
            var latestValue = points.MaxBy(u => u.DT).Value;
            var earliestValue = points.MinBy(u => u.DT).Value;
        }
    }   
}

Performance is crucial (the faster the better as I need to filter out and group ~100k of points). 性能至关重要(速度越快越好,因为我需要过滤并分组约100k点)。

Here is a single SQL query solution: 这是一个SQL查询解决方案:

var baseQueries = periodSlices
    .Select(slice => db.Points
        .Select(p => new { Period = new Period { Begin = slice.Begin, End = slice.End }, p.DT })
        .Where(p => p.DT >= p.Period.Begin && p.DT <= p.Period.End)
    );

var unionQuery = baseQueries
    .Aggregate(Queryable.Concat);

var periodQuery = unionQuery
    .GroupBy(p => p.Period)
    .Select(g => new
    {
        Period = g.Key,
        MinDT = g.Min(p => p.DT),
        MaxDT = g.Max(p => p.DT),
    });

var finalQuery =
    from p in periodQuery
    join pMin in db.Points on p.MinDT equals pMin.DT
    join pMax in db.Points on p.MaxDT equals pMax.DT
    select new
    {
        Period = p.Period,
        EarliestPoint = pMin,
        LatestPoint = pMax,
    };

I've separated the LINQ query parts into separate variables just for readability. 为了方便阅读,我将LINQ查询部分分成了单独的变量。 To get the result, only the final query should be executed: 要获得结果,仅应执行最终查询:

var result = finalQuery.ToList();

Basically we build a UNION ALL query for each slice, then determine the minimum and maximum dates fro each period, and finally get the corresponding values for these date. 基本上,我们为每个切片建立UNION ALL查询,然后确定每个期间的最小和最大日期,最后获得这些日期的相应值。 I've used join instead of the "typical" OrderBy(Descending) + FirstOrDefault() inside the grouping because the later generates terrible SQL. 我在分组内部使用了join而不是“典型的” OrderBy(Descending) + FirstOrDefault() ,因为后者会生成可怕的SQL。

Now, the main question. 现在,主要问题。 I can't say if this would be faster than the original approach - it depends on whether the DT column is indexed and the count of periodSlices because each slice adds another UNION ALL SELECT from source table in the query, which for 3 slices looks like this 我不能说这是否会比原始方法快-它取决于DT列是否已索引以及periodSlices的计数,因为每个切片在查询中从源表中添加了另一个UNION ALL SELECT ,对于3个切片来说这个

SELECT
    [GroupBy1].[K1] AS [C1],
    [GroupBy1].[K2] AS [C2],
    [GroupBy1].[K3] AS [C3],
    [Extent4].[DT] AS [DT],
    [Extent4].[Value] AS [Value],
    [Extent5].[DT] AS [DT1],
    [Extent5].[Value] AS [Value1]
    FROM    (SELECT
        [UnionAll2].[C1] AS [K1],
        [UnionAll2].[C2] AS [K2],
        [UnionAll2].[C3] AS [K3],
        MIN([UnionAll2].[DT]) AS [A1],
        MAX([UnionAll2].[DT]) AS [A2]
        FROM  (SELECT
            1 AS [C1],
            @p__linq__0 AS [C2],
            @p__linq__1 AS [C3],
            [Extent1].[DT] AS [DT]
            FROM [dbo].[Point] AS [Extent1]
            WHERE ([Extent1].[DT] >= @p__linq__0) AND ([Extent1].[DT] <= @p__linq__1)
        UNION ALL
            SELECT
            1 AS [C1],
            @p__linq__2 AS [C2],
            @p__linq__3 AS [C3],
            [Extent2].[DT] AS [DT]
            FROM [dbo].[Point] AS [Extent2]
            WHERE ([Extent2].[DT] >= @p__linq__2) AND ([Extent2].[DT] <= @p__linq__3)
        UNION ALL
            SELECT
            1 AS [C1],
            @p__linq__4 AS [C2],
            @p__linq__5 AS [C3],
            [Extent3].[DT] AS [DT]
            FROM [dbo].[Point] AS [Extent3]
            WHERE ([Extent3].[DT] >= @p__linq__4) AND ([Extent3].[DT] <= @p__linq__5)) AS [UnionAll2]
        GROUP BY [UnionAll2].[C1], [UnionAll2].[C2], [UnionAll2].[C3] ) AS [GroupBy1]
    INNER JOIN [dbo].[Point] AS [Extent4] ON [GroupBy1].[A1] = [Extent4].[DT]
    INNER JOIN [dbo].[Point] AS [Extent5] ON [GroupBy1].[A2] = [Extent5].[DT]

If you want to get the earliest (min) and latest (max) point in each time slice, the first thing I would look at is getting the database to do more. 如果要在每个时间片中获得最早的(最小)和最新的(最大)点,那么我要看的第一件事就是让数据库做更多的事情。

When you call .ToArray() it brings all the selected points into memory. 调用.ToArray()时,它将所有选定的点都带入内存。 This is pointless as you only want 2 per slice. 这是没有意义的,因为每个切片只需要2个。 So if you did somehting like: 因此,如果您这样做:

foreach (var slice in periodSlices)
{
    var q = context
                .Points
                .Where(p => p.DT >= slice.Begin && p.DT <= slice.End)
                .OrderBy(x => x.DT);
    var min = q.FirstOrDefault();
    var max = q.LastOrDefault();
}

It might work better 可能会更好

I say might because it depends on what indexes there are on the database and how many points are in each slice. 我说的威力 ,因为这要看是什么指标有在数据库上多少点在每个切片。 Ultimately to get really good performance you may have to add an index on the datetime, or, change the structure so the min and max are pre-stored, or do it in a stored proc. 最终要获得真正好的性能,您可能必须在日期时间上添加索引,或者更改结构以使min和max预先存储,或者在存储的proc中进行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM