为什么会有这种饮食记忆？

Question

I wrote an application whose purpose is to read logs from a large table (90 million) and process them into easily understandable stats, how many, how long etc. 我写了一个应用程序，目的是从一张大表（9000万）中读取日志，并将其处理为易于理解的统计数据，多少，多长时间等。

The first run took 7.5 hours and only had to process 27 of the 90 million. 第一次运行耗时7.5个小时，仅需处理9000万个中的27个。 I would like to speed this up. 我想加快速度。 So I am trying to run the queries in parallel. 因此，我试图并行运行查询。 But when I run the below code, within a couple minutes I crash with an Out of Memory exception. 但是，当我运行下面的代码时，在几分钟之内，我遇到了内存不足异常而崩溃。

Environments: 环境：

Sync 同步

Test : 26 Applications, 15 million logs, 5 million retrieved, < 20mb, takes 20 seconds 测试：26个应用程序，1500万条日志，检索到500万条，<20mb，耗时20秒

Production: 56 Applications, 90 million logs, 27 million retrieved, < 30mb, takes 7.5 hours 生产：56个应用程序，9000万个日志，2700万个检索到的，<30mb，耗时7.5小时

Async 异步

Test : 26 Applications, 15 million logs, 5 million retrieved, < 20mb, takes 3 seconds 测试：26个应用程序，1500万条日志，检索到500万条，<20mb，耗时3秒

Production: 56 Applications, 90 million logs, 27 million retrieved, Memory Exception 生产：56个应用程序，9000万个日志，2700万个已检索，内存异常

public void Run()
{
    List<Application> apps;

    //Query for apps
    using (var ctx = new MyContext())
    {
        apps = ctx.Applications.Where(x => x.Type == "TypeIWant").ToList();
    }

    var tasks = new Task[apps.Count];
    for (int i = 0; i < apps.Count; i++)
    {
        var app = apps[i];
        tasks[i] = Task.Run(() => Process(app));
    }

    //try catch
    Task.WaitAll(tasks);
}

public void Process(Application app)
{
    //Query for logs for time period
    using (var ctx = new MyContext())
    {
        var logs = ctx.Logs.Where(l => l.Id == app.Id).AsNoTracking();

        foreach (var log in logs)
        {
            Interlocked.Increment(ref _totalLogsRead);

            var l = log;
            Task.Run(() => ProcessLog(l, app.Id));
        }
    }
}

Is it ill advised to create 56 contexts? 不建议创建56个上下文吗？

Do I need to dispose and re-create contexts after a certain number of logs retrieved? 在检索到一定数量的日志后，是否需要处理并重新创建上下文？

Perhaps I'm misunderstanding how the IQueryable is working? 也许我误会了IQueryable的工作方式？ <-- My Guess <-我的猜测

My understanding is that it will retrieve logs as needed, I guess that means for the loop is it like a yield? 我的理解是，它将根据需要检索日志，我想这对于循环意味着像产量吗？ or is my issue that 56 'threads' call to the database and I am storing 27 million logs in memory? 还是我的问题是有56个“线程”调用数据库，并且我将2700万条日志存储在内存中？

Side question 附带问题

The results don't really scale together. 结果并不完全相同。 Based on the Test environment results i would expect Production would only take a few minutes. 根据测试环境的结果，我希望生产仅需几分钟。 I assume the increase is directly related to the number of records in the table. 我假设增加与表中的记录数直接相关。

Answer 1

With 27 Million rows the problem is one of stream processing, not parallel execution. 对于2700万行，问题是流处理之一，而不是并行执行。 You need to approach the problem as you would with SQL Server's SSIS or any other ETL tools: each processing step is a transofrmation that processes its input and sends its output to the next step. 您需要像使用SQL Server的SSIS或任何其他ETL工具一样处理该问题：每个处理步骤都是一个转换过程，处理其输入并将其输出发送到下一步。

Parallel processing is achieved by using a separate thread to run each step. 通过使用单独的线程运行每个步骤来实现并行处理。 Some steps could also use multiple threads to process multiple inputs up to a limit. 某些步骤还可以使用多个线程来处理多个输入，直至达到限制。 Setting limits to each step's thread count and input buffer ensures you can achieve maximum throughput without flooding your machine with waiting tasks. 为每个步骤的线程数和输入缓冲区设置限制可确保您实现最大吞吐量，而不会在等待任务的情况下淹没机器。

.NET's TPL Dataflow addresses exactly this scenario. .NET的TPL数据流恰好解决了这种情况。 It provides blocks to transfrom inputs to outputs (TransformBlock), split collections to individual messages (TransformManyBlock), execute actions without transformations (ActionBlock), combine data in batches (BatchBlock) etc. 它提供了从输入到输出的转换块（TransformBlock），将集合拆分为单个消息（TransformManyBlock），无需转换即可执行动作（ActionBlock），将数据成批合并（BatchBlock）等。

You can also specify the Maximum degree of parallelism for each step so that, eg. 您还可以为每个步骤指定最大并行度，例如 you have only 1 log queries executing at each time, but use 10 tasks for log processing. 您每次仅执行1个日志查询，但使用10个任务进行日志处理。

In your case, you could: 就您而言，您可以：

Start with a TransformManyBlock that receives an application type and returns a list of app IDs 从接收应用程序类型并返回应用程序ID列表的TransformManyBlock开始
A TranformBlock reads the logs for a specific ID and sends them downstream TranformBlock读取日志中的特定ID，并将其发送到下游
An ActionBlock processes the batch. 一个ActionBlock处理批处理。

Step #3 could be broken to many other steps. 步骤＃3可以分为许多其他步骤。 Eg if you don't need to process all app log entries together, you can use a step to process individual entries. 例如，如果你不需要处理所有的应用程序日志条目在一起，你可以使用一个步骤来处理各个条目。 Or you could first group them by date. 或者，您可以先按日期对它们进行分组。

Another option is to create a custom block to read data from the database using a DbDataReader and post each entry to the next step immediatelly, instead of waiting for all rows to return. 另一种选择是创建一个自定义块，以使用DbDataReader从数据库读取数据，并将每个条目立即发布到下一步，而不是等待所有行返回。 This would allow you to process each entry as it arrives, instead of waiting to receive all entries. 这样，您就可以在每个条目到达时对其进行处理，而不必等待接收所有条目。

If each app log contains many entries, this could be a huge memory and time saver 如果每个应用程序日志包含许多条目，则可能会占用大量内存并节省时间

为什么会有这种饮食记忆？

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-03-03 16:19:32

为什么会有这种饮食记忆？

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-03-03 16:19:32

解决方案1
0 已采纳 2016-03-03 16:19:32