简体   繁体   English

使用LINQ to SQL读取庞大的表:耗尽内存与慢速分页

[英]Read huge table with LINQ to SQL: Running out of memory vs slow paging

I have a huge table which I need to read through on a certain order and compute some aggregate statistics. 我有一个庞大的表,我需要阅读某个订单并计算一些汇总统计数据。 The table already has a clustered index for the correct order so getting the records themselves is pretty fast. 该表已经有一个正确顺序的聚集索引,因此获取记录本身非常快。 I'm trying to use LINQ to SQL to simplify the code that I need to write. 我正在尝试使用LINQ to SQL来简化我需要编写的代码。 The problem is that I don't want to load all the objects into memory, since the DataContext seems to keep them around -- yet trying to page them results in horrible performance problems. 问题是我不想将所有对象加载到内存中,因为DataContext似乎可以保留它们 - 但是尝试对它们进行分页会导致可怕的性能问题。

Here's the breakdown. 这是故障。 Original attempt was this: 最初的尝试是这样的:

var logs = 
    (from record in dataContext.someTable 
     where [index is appropriate]
     select record);

foreach( linqEntity l in logs )
{
    // Do stuff with data from l
}

This is pretty fast, and streams at a good rate, but the problem is that the memory use of the application keeps going up never stops. 这是非常快的,并且流速很好,但问题是应用程序的内存使用量不断上升。 My guess is that the LINQ to SQL entities are being kept around in memory and not being disposed properly. 我的猜测是LINQ to SQL实体被保留在内存中而没有正确处理。 So after reading Out of memory when creating a lot of objects C# , I tried the following approach. 因此, 在创建了很多对象C#时读取内存不足后,我尝试了以下方法。 This seems to be the common Skip / Take paradigm that many people use, with the added feature of saving memory. 这似乎是许多人使用的常见Skip / Take范例,具有节省内存的附加功能。

Note that _conn is created beforehand, and a temporary data context is created for each query, resulting in the associated entities being garbage collected. 请注意,事先创建了_conn ,并为每个查询创建临时数据上下文,从而导致关联的实体被垃圾回收。

int skipAmount = 0;
bool finished = false;

while (!finished)
{
    // Trick to allow for automatic garbage collection while iterating through the DB
    using (var tempDataContext = new MyDataContext(_conn) {CommandTimeout = 600})
    {               
        var query =
            (from record in tempDataContext.someTable
             where [index is appropriate]
             select record);

        List<workerLog> logs = query.Skip(skipAmount).Take(BatchSize).ToList();
        if (logs.Count == 0)
        {
            finished = true;
            continue;
        }

        foreach( linqEntity l in logs )
        {
            // Do stuff with data from l
        }

        skipAmount += logs.Count;
    }
}

Now I have the desired behavior that memory usage doesn't increase at all as I am streaming through the data. 现在我有了所需的行为,因为当我通过数据流时,内存使用量根本没有增加。 Yet, I have a far worse problem: each Skip is causing the data to load more and more slowly as the underlying query seems to actually cause the server to go through all the data for all previous pages. 然而,我有一个更糟糕的问题:每个Skip导致数据加载越来越慢,因为底层查询似乎实际上导致服务器遍历所有先前页面的所有数据。 While running the query each page takes longer and longer to load, and I can tell that this is turning into a quadratic operation. 在运行查询时,每个页面加载的时间越来越长,我可以说这正在转变为二次运算。 This problem has appeared in the following posts: 此问题出现在以下帖子中:

I can't seem to find a way to do this with LINQ that allows me to have limited memory use by paging data, and yet still have each page load in constant time. 我似乎无法通过LINQ找到一种方法来实现这一点,它允许我通过分页数据来限制内存使用,但仍然可以在每个页面加载恒定时间。 Is there a way to do this properly? 有没有办法正确地做到这一点? My hunch is that there might be some way to tell the DataContext to explicitly forget about the object in the first approach above, but I can't find out how to do that. 我的预感是,可能有某种方式告诉DataContext明确忘记上面第一种方法中的对象,但我无法找到如何做到这一点。

After madly grasping at some straws, I found that the DataContext 's ObjectTrackingEnabled = false could be just what the doctor ordered. 在疯狂地抓住一些吸管后,我发现DataContextObjectTrackingEnabled = false可能正是医生所要求的。 It is, not surprisingly, specifically designed for a read-only case like this. 毫不奇怪,它是专为这样的只读案例而设计的。

using (var readOnlyDataContext = 
    new MyDataContext(_conn) {CommandTimeout = really_long, ObjectTrackingEnabled = false})
{                                                 
    var logs =
        (from record in readOnlyDataContext.someTable
         where [index is appropriate]
         select record);

    foreach( linqEntity l in logs )
    {
        // Do stuff with data from l   
    }                
}

The above approach does not use any memory when streaming through objects. 当通过对象流式传输时,上述方法不使用任何内存。 When writing data, I can use a different DataContext that has object tracking enabled, and that seems to work okay. 在编写数据时,我可以使用启用了对象跟踪的不同DataContext ,这似乎工作正常。 However, this approach does have the problem of a SQL query that can take an hour or more to stream and complete, so if there's a way to do the paging as above without the performance hit, I'm open to other alternatives. 但是,这种方法确实存在SQL查询的问题,可能需要一个小时或更长时间才能流式传输和完成,因此,如果有一种方法可以在不影响性能的情况下进行上述分页,那么我对其他选择持开放态度。

A warning about turning object tracking off : I found out that when you try to do multiple concurrent reads with the same DataContext , you don't get the error There is already an open DataReader associated with this Command which must be closed first. 关于关闭对象跟踪的警告 :我发现当您尝试使用相同的DataContext执行多个并发读取时,您不会收到错误There is already an open DataReader associated with this Command which must be closed first. The application just goes into an infinite loop with 100% CPU usage. 应用程序进入无限循环,CPU使用率为100%。 I'm not sure if this is a C# bug or a feature. 我不确定这是一个C#错误还是一个功能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM