简体   繁体   English

IEnumerable的 <T> ,Parallel.ForEach和内存管理

[英]IEnumerable<T>, Parallel.ForEach and Memory Management

I am reading and processing very large amounts of Sql Server data (10's of millions+ of rows in, 100's of millions+ rows out). 我正在阅读和处理非常大量的Sql Server数据(数百万行中有10行,100多万行+行)。 The processing performed on each source row is significant. 在每个源行上执行的处理很重要。 A single threaded version is not performing to expectations. 单线程版本没有达到预期效果。 My current parallel processing version is performing very well on some smaller batches (300,000 source rows, 1M output rows), but I am running into some Out Of Memory exceptions for very large runs. 我当前的并行处理版本在一些较小的批次(300,000个源行,1M输出行)上表现非常好,但是我遇到了一些内存不足的异常,非常大的运行。

The code was significantly inspired by the answers provided here: Is there a way to use the Task Parallel Library(TPL) with SQLDataReader? 代码受到了这里提供的答案的启发: 有没有办法将任务并行库(TPL)与SQLDataReader一起使用?

Here is the general idea: 这是一般的想法:

Get the source data (data is too large to read into memory, so we will “stream” it) 获取源数据(数据太大而无法读入内存,因此我们将“流式传输”)

public static IEnumerable<MyObject> ReadData()
{
    using (SqlConnection con = new SqlConnection(Settings.ConnectionString)) 
       using (SqlCommand cmd = new SqlCommand(selectionSql, con))
       {
            con.Open();
            using (SqlDataReader dr = cmd.ExecuteReader(CommandBehavior.CloseConnection))
            {
            while (dr.Read())
            {
                // make some decisions here – 1 to n source rows are used
                // to create an instance of MyObject
                yield return new MyObject(some parameters);
            }
        }
    }
}

Once we get to the point of parallel processing, we want to use the SqlBulkCopy object to write the data. 一旦我们到达并行处理点,我们希望使用SqlBulkCopy对象来写入数据。 Because of this, we don't want to process individual MyObjects in parallel as we want to perform a bulk copy per thread. 因此,我们不希望并行处理单个MyObjects,因为我们希望每个线程执行批量复制。 Because of this, we'll read from above with another IEnumerable that returns a “batch” of MyObjects 因此,我们将从上面读取另一个返回“批量”MyObjects的IEnumerable

class MyObjectBatch 
{
    public List<MyObject> Items { get; set; }

    public MyObjectBatch (List<MyObject> items)
    {
        this.Items = items;
    }

    public static IEnumerable<MyObjectBatch> Read(int batchSize)
    {
        List<MyObject> items = new List<MyObjectBatch>();
        foreach (MyObject o in DataAccessLayer.ReadData())
        {
            items.Add(o);
            if (items.Count >= batchSize)
            {
                yield return new MyObjectBatch(items);                    
                items = new List<MyObject>(); // reset
            }
        }
        if (items.Count > 0) yield return new MyObjectBatch(items);            
    }
}

Finally, we get to the point of parallel processing the “batches” 最后,我们达到并行处理“批次”的程度

ObjectProcessor processor = new ObjectProcessor();

ParallelOptions options = new ParallelOptions { MaxDegreeOfParallelism = Settings.MaxThreads };
Parallel.ForEach(MyObjectBatch.Read(Settings.BatchSize), options, batch =>
{
    // Create a container for data processed by this thread
    // the container implements IDataReader
    ProcessedData targetData = new ProcessedData(some params));

    // process the batch… for each MyObject in MyObjectBatch – 
    // results are collected in targetData
    for (int index = 0; index < batch.Items.Count; index++) 
    {
        processor.Process(batch.Item[index], targetData);
    }

    // bulk copy the data – this creates a SqlBulkCopy instance
    // and loads the data to the target table
    DataAccessLayer.BulkCopyData(targetData);

    // explicitly set the batch and targetData to null to try to free resources

});

Everything above has been significantly simplified, but I believe it includes all of the important concepts. 以上所有内容都已大大简化,但我相信它包含了所有重要概念。 Here is the behavior I am seeing: 这是我看到的行为:

Performance is very good – for reasonable sized data sets, I am getting very good results. 性能非常好 - 对于合理大小的数据集,我得到了非常好的结果。

However, as it processes, the memory consumed continues to grow. 但是,随着它的处理,消耗的内存继续增长。 For larger data sets, this leads to exceptions. 对于较大的数据集,这会导致异常。

I have proved through logging, that if I slow down the reads from the database, it slows down the batch reads and subsequently, the parallel threads being created (especially if I set the MaxDegreeOfParallelization). 我通过日志记录证明,如果我减慢数据库的读取速度,它会减慢批量读取的速度,然后创建并行线程(特别是如果我设置了MaxDegreeOfParallelization)。 I was concerned that I was reading faster than I could process, but if I limit the threads, it should only read what each thread can handle. 我担心我的阅读速度比我能处理的要快,但是如果我限制线程,它应该只读取每个线程可以处理的内容。

Smaller or larger batch sizes have some effect on performance, but the amount of memory used grows consistently with the size of the batch. 较小或较大的批量大小对性能有一些影响,但使用的内存量与批量大小一致。

Where is there an opportunity to recover some memory here? 哪里有机会在这里恢复一些记忆? As my “batches” go out of scope, should that memory be recovered? 由于我的“批次”超出范围,是否应恢复记忆? Is there something I could be doing at the first two layers that would help free some resources? 我可以在前两个层面做些什么来帮助释放一些资源吗?

To answer some questions: 1. Could it be done purely in SQL - no, the processing logic is very complex (and dynamic). 回答一些问题:1。它可以纯粹用SQL完成 - 不,处理逻辑非常复杂(和动态)。 Generally speaking, it is doing low-level binary decoding. 一般来说,它正在进行低级二进制解码。 2. We have tried SSIS (with some success). 我们尝试过SSIS(取得了一些成功)。 The issue is that the definition of the source data as well as the output is very dynamic. 问题是源数据的定义以及输出是非常动态的。 SSIS seems to require very strict input and output column definition which won't work in this case. SSIS似乎需要非常严格的输入和输出列定义,在这种情况下不起作用。

Someone also asked about the ProcessedData object - this is actually fairly simple: 有人还问过ProcessedData对象 - 实际上这很简单:

class ProcessedData : IDataReader 
{
    private int _currentIndex = -1;
    private string[] _fieldNames { get; set; }

    public string TechnicalTableName { get; set; }        
    public List<object[]> Values { get; set; }

    public ProcessedData(string schemaName, string tableName, string[] fieldNames)
    {            
        this.TechnicalTableName = "[" + schemaName + "].[" + tableName + "]";
        _fieldNames = fieldNames;            
        this.Values = new List<object[]>();
    }

    #region IDataReader Implementation

    public int FieldCount
    {
        get { return _fieldNames.Length; }
    }

    public string GetName(int i)
    {
        return _fieldNames[i];
    }

    public int GetOrdinal(string name)
    {
        int index = -1;
        for (int i = 0; i < _fieldNames.Length; i++)
        {
            if (_fieldNames[i] == name)
            {
                index = i;
                break;
            }
        }
        return index;
    }

    public object GetValue(int i)
    {
        if (i > (Values[_currentIndex].Length- 1))
        {
            return null;
        }
        else
        {
            return Values[_currentIndex][i];
        }
    }

    public bool Read()
    {
        if ((_currentIndex + 1) < Values.Count)
        {
            _currentIndex++;
            return true;
        }
        else
        {
            return false;
        }
    }

    // Other IDataReader things not used by SqlBulkCopy not implemented
}

UPDATE and CONCLUSION: 更新和结论:

I received a great deal of valuable input, but wanted to summarize it all into a single conclusion. 我收到了大量有价值的意见,但我想把它总结为一个结论。 First, my main question was if there was anything else I could do (with the code I posted) to aggressively reclaim memory. 首先,我的主要问题是,是否还有其他任何事情(我发布的代码)可以积极地回收内存。 The consensus seems to be that the approach is correct, but that my particular problem is not entirely bound by CPU, so a simple Parallel.ForEach will not manage the processing correctly. 共识似乎是方法是正确的,但我的特定问题并不完全受CPU限制,因此简单的Parallel.ForEach将无法正确管理处理。

Thanks to usr for his debugging suggestion and his very interesting PLINQ suggestion. 感谢usr的调试建议以及他非常有趣的PLINQ建议。 Thanks to zmbq for helping clarify what was and wasn't happening. 感谢zmbq帮助澄清什么是和未发生的事情。

Finally, anyone else who may be chasing a similar issue will likely find the following discussions helpful: 最后,任何可能正在追逐类似问题的人都可能会发现以下讨论有用:

Parallel.ForEach can cause a "Out Of Memory" exception if working with a enumerable with a large object 如果使用具有大对象的枚举,Parallel.ForEach可能会导致“内存不足”异常

Parallel Operation Batching 并行操作批处理

I do not fully understand how Parallel.ForEach is pulling items, but I think by default it pulls more than one to save locking overhead. 我不完全理解Parallel.ForEach是如何提取项目的,但我认为默认情况下它会提取多个以节省锁定开销。 This means that multiple items might be queued internally inside of Parallel.ForEach . 这意味着可以在Parallel.ForEach内部排队多个项目。 This might cause OOM quickly because your items are very big individually. 这可能会导致OOM快速,因为您的项目非常大。

You could try giving it a Partitioner that returns single items . 您可以尝试为其提供返回单个项目Partitioner程序

If that does not help, we need to dig deeper. 如果这没有帮助,我们需要深入挖掘。 Debugging memory issues with Parallel and PLINQ is nasty. 使用Parallel和PLINQ调试内存问题是令人讨厌的。 There was in bug in one of those, for example, that caused old items not to be released quickly. 例如,其中一个中存在错误,导致旧项目无法快速释放。

As a workaround, you could clear the list after processing. 作为解决方法,您可以在处理后清除列表。 That will at least allow all items to be reclaimed deterministically after processing has been done. 这将至少允许在处理完成后确定性地回收所有项目。

Regarding the code you posted: It is clean, of high quality and you are adhering to high standards of resource management. 关于您发布的代码:它是干净的,高质量的,您遵守高标准的资源管理。 I would not suspect a gross memory or resource leak on your part. 我不会怀疑你的内存或资源泄漏。 It is still not impossible. 这仍然不是不可能的。 You can test this by commenting out the code inside of the Parallel.ForEach and replacing it with a Thread.Sleep(1000 * 60) . 您可以通过注释Parallel.ForEach的代码并将其替换为Thread.Sleep(1000 * 60) If the leak persists, you are not at fault. 如果泄漏仍然存在,那么您没有错。

In my experience, PLINQ is easier to get an exact degree of parallelism with (because the current version uses the exact DOP you specify, never less never more). 根据我的经验,PLINQ更容易获得精确的并行度(因为当前版本使用您指定的精确DOP,永远不会更少)。 Like this: 像这样:

GetRows()
.AsBatches(10000)    
.AsParallel().WithDegreeOfParallelism(8)
.Select(TransformItems) //generate rows to write
.AsEnumerable() //leave PLINQ
.SelectMany(x => x) //flatten batches
.AsBatches(1000000) //create new batches with different size
.AsParallel().WithDegreeOfParallelism(2) //PLINQ with different DOP
.ForEach(WriteBatchToDB); //write to DB

This would give you a simple pipeline that pulls from the DB, does CPU-bound work with a specific DOP optimized for the CPU, and writes to the database with much bigger batches and less DOP. 这将为您提供一个从DB中提取的简单管道,使用针对CPU优化的特定DOP进行CPU绑定工作,并使用更大批量和更少DOP写入数据库。

This is quite simple and it should max out CPUs and disks independently with their respective DOP. 这非常简单,它应该使用各自的DOP独立地最大化CPU和磁盘。 Play with the DOP numbers. 玩DOP号码。

You're keeping two things in memory - your input data and your output data. 你在内存中保留了两件事 - 输入数据和输出数据。 You've tried to read and process that data in parallel, but you're not reducing the overall memory footprint - you still end up keeping most the data in memory - the more threads you have, the more data you keep in memory. 您已经尝试并行读取和处理这些数据,但是您并没有减少整体内存占用 - 您最终仍将大部分数据保留在内存中 - 您拥有的线程越多,您在内存中保留的数据就越多。

I guess most of the memory is taken by your output data, as you create 10 times more output records than input records. 我猜大部分内存是由输出数据占用的,因为你创建的输出记录比输入记录多10倍。 So you have a few (10? 30? 50) SqlBulkCopy operations. 所以你有一些(10?30?50)SqlBulkCopy操作。

That is actually too much. 这实际上太多了。 You can gain a lot of speed by writing 100,000 records in bulk. 您可以通过批量写入10万条记录获得很多的速度。 What you should do is split your work - read 10,000-20,000 records, create the output records, SqlBulkCopy to the database, and repeat. 你应该做的是拆分工作 - 读取10,000-20,000条记录,创建输出记录,将SqlBulkCopy写入数据库,然后重复。 Your memory consumption will drop considerably. 你的记忆消耗会大幅下降。

You can, of course, do that in parallel - handle several 10,000 record batches in parallel. 当然,您可以并行执行此操作 - 并行处理多个10,000个记录批次。

Just keep in mind that Parallel.ForEach and the thread-pool in general is meant to optimize CPU usage. 请记住,Parallel.ForEach和线程池通常用于优化CPU使用率。 Chances are what limits you is I/O on the database server. 有可能限制您在数据库服务器上的I / O. While databases can handle concurrency quite well, their limit doesn't depend on the number of cores on your client machine, so you'd better play with the number of concurrent threads and see what's fastest. 虽然数据库可以很好地处理并发,但它们的限制并不依赖于客户端计算机上的内核数量,因此您最好使用并发线程数并查看最快的内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM