简体   繁体   English

C#拆分列表 <T> 使用TPL Parallel ForEach分组

[英]C# Split List<T> into groups using TPL Parallel ForEach

I need to process a List<T> of thousands of elements. 我需要处理成千上万个元素的List<T>

First I need to group the elements by year and type, so I obtain a List<List<T>> . 首先,我需要按年份和类型对元素进行分组,因此获得List<List<T>> Then for each internal List<T> I want to add objects of type T until the max package size is reached for the List<T> , then I create a new package and go on the same way. 然后,对于每个内部List<T>我想添加类型T的对象,直到List<T>达到最大包大小,然后创建一个新包,并以相同的方式进行。

I want to use Parallel.ForEach loop. 我想使用Parallel.ForEach循环。

My actual implementation works well if I run it sequentially, but the logic is not Thread Safe and I want to change it. 如果我按顺序运行它,我的实际实现效果很好,但是逻辑不是线程安全的,我想更改它。
I think the problem is on the inner Parallel.ForEach loop, when the max size for the List<T> is reached and I instantiate a new List<T> inside the same reference. 我认为问题出在内部Parallel.ForEach循环上,当达到List<T>的最大大小时,我在同一引用内实例化了一个新的List<T>

private ConcurrentBag<ConcurrentBag<DumpDocument>> InitializePackages()
{
    // Group by Type and Year
    ConcurrentBag<ConcurrentBag<DumpDocument>> groups = new ConcurrentBag<ConcurrentBag<DumpDocument>>(Dump.DumpDocuments.GroupBy(d => new { d.Type, d.Year })
        .Select(g => new ConcurrentBag<DumpDocument> (g.ToList()))
        .ToList());

    // Documents lists with max package dimension
    ConcurrentBag<ConcurrentBag<DumpDocument>> documentGroups = new ConcurrentBag<ConcurrentBag<DumpDocument>>();

    foreach (ConcurrentBag<DumpDocument> group in groups)
    {       
        long currentPackageSize = 0;

        ConcurrentBag<DumpDocument> documentGroup = new ConcurrentBag<DumpDocument>();

        ParallelOptions options = new ParallelOptions { MaxDegreeOfParallelism = Parameters.MaxDegreeOfParallelism };
        Parallel.ForEach(group, options, new Action<DumpDocument, ParallelLoopState>((DumpDocument document, ParallelLoopState state) =>
            {
                long currentDocumentSize = new FileInfo(document.FilePath).Length;

                // If MaxPackageSize = 0 then no splitting to apply and the process works well
                if (Parameters.MaxPackageSize > 0 && currentPackageSize + currentDocumentSize > Parameters.MaxPackageSize)
                {
                    documentGroups.Add(documentGroup);

                    // Here's the problem!
                    documentGroup = new ConcurrentBag<DumpDocument>();

                    currentPackageSize = 0;
                }

                documentGroup.Add(document);
                currentPackageSize += currentDocumentSize;
            }));

        if (documentGroup.Count > 0)
            documentGroups.Add(documentGroup);
    }

    return documentGroups;
}

public class DumpDocument
{
    public string Id { get; set; }
    public long Type { get; set; }
    public string MimeType { get; set; }
    public int Year { get; set; }
    public string FilePath { get; set; }
}

Since my operation is quite simple, actually I only need to get the file size using: 由于我的操作非常简单,因此实际上我只需要使用以下方法获取文件大小:

long currentDocumentSize = new FileInfo(document.FilePath).Length;

I read around that I can also use a Partitioner , but I've never used that and anyway it's not my priority at the moment. 我了解到我也可以使用Partitioner ,但是我从未使用过它,无论如何现在这不是我的优先考虑。

I also already read this question that is similar but doesn't solve my problem with the inner loop. 我也已经读过类似的问题 ,但不能解决内部循环问题。

UPDATE 28/12/2016 更新28/12/2016

I updated the code to meet verification requirements. 我更新了代码以满足验证要求。

After the code update it seems that you'are using the ConcurrentBag so the is another non-thread-safe logic left in your code: 代码更新后,您似乎正在使用ConcurrentBag因此代码中还有另一个非线程安全的逻辑:

long currentPackageSize = 0;
if (// .. && 
    currentPackageSize + currentDocumentSize > Parameters.MaxPackageSize
// ...
{
    // ...
    currentPackageSize += currentDocumentSize;
}

+= operator isn't atomic, and you'll definitely have a race condition there, and reading the value of a long variable isn't thread-safe here. +=运算符不是原子的,您肯定在那里有一个竞争条件,在这里读取long变量的值不是线程安全的。 You may introduce the locks there, or to use the Interlocked class to atomically update the value: 您可以在此处引入locks ,也可以使用Interlocked自动更新该值:

Interlocked.Add(ref currentPackageSize, currentDocumentSize);
Interlocked.Exchange(ref currentPackageSize, 0);
Interlocked.Read(ref currentPackageSize);

Using this class will lead for a some refactoring code (I think that usage of CAS operations such as CompareExchange is more preferable in your case), so, maybe for you it is easiest way to use the locks. 使用此类将导致一些重构代码(我认为,在您的情况下,最好使用CAS操作(例如CompareExchange )),因此,也许对您而言,这是使用锁的最简单方法。 You probably should implement both ways and test them and measure the execution time). 您可能应该同时实现这两种方法并对其进行测试,并衡量执行时间。

Also, as you can see, the instantiation isn't thread-safe too, so you had to either lock the variable (which will lead to thread synchronization pause) or refactor your code to two-steps: at first you get all the file sizes in parallel, after that you iterate over results in sequential manner, avoiding race conditions. 而且,正如您所看到的,实例化也不是线程安全的,因此您必须锁定变量(这将导致线程同步暂停)或将代码重构为两步:首先,您获取了所有文件并行排列大小,然后按顺序迭代结果,避免出现竞争情况。

As for the Partitioner , you shouldn't use this class here, as it's usually being used to schedule the work across the CPU, not to split the results. 至于Partitioner ,您不应该在这里使用此类,因为它通常用于在CPU上安排工作,而不是分割结果。

However, I'd like to note that you have some minor code issues: 但是,我想指出您有一些次要的代码问题:

  1. You can remove ToList() calls inside the constructors of the ConcurrentBag because it accepts the IEnumerable , which you already have: 您可以在ConcurrentBag的构造函数中删除ToList()调用,因为它接受IEnumerable ,您已经拥有它:

     ConcurrentBag<ConcurrentBag<DumpDocument>> groups = new ConcurrentBag<ConcurrentBag<DumpDocument>>(Dump.DumpDocuments.GroupBy(d => new { d.Type, d.Year }) .Select(g => new ConcurrentBag<DumpDocument> (g))); 

    This will help you to avoid unnecessary copies of your grouped data 这将帮助您避免不必要的分组数据副本

  2. You can use the var keyword to avoid the duplication of the types in your code (this is just a sample line, you can change it many times across your code): 您可以使用var关键字来避免代码中类型的重复(这只是示例行,可以在代码中多次更改):

     foreach (var group in groups) 
  3. You should not use maximum degree of parallelism unless you're knowing what you're doing (and I think that you aren't): 除非您知道自己在做什么(并且我认为不是),否则不应该使用最大并行度:

     var options = new ParallelOptions { MaxDegreeOfParallelism = Parameters.MaxDegreeOfParallelism }; 

    TPL default task scheduler tries to adjust the thread pool and CPU usage for your tasks, so in general this number should be equal to Environment.ProcessorCount . TPL默认任务计划程序会尝试调整任务的线程池和CPU使用率,因此通常此数字应等于Environment.ProcessorCount

  4. You can use lambda syntax for the Parallel.ForEach , and do not create a new Action (you can also move out this code to a method routine): 您可以将lambda语法用于Parallel.ForEach ,而不创建新的Action (也可以将以下代码移到方法例程中):

     (document, state) => { long currentDocumentSize = new FileInfo(document.FilePath).Length; // If MaxPackageSize = 0 then no splitting to apply and the process works well if (Parameters.MaxPackageSize > 0 && currentPackageSize + currentDocumentSize > Parameters.MaxPackageSize) { documentGroups.Add(documentGroup); // Here's the problem! documentGroup = new ConcurrentBag<DumpDocument>(); currentPackageSize = 0; } documentGroup.Add(document); currentPackageSize += currentDocumentSize; } 

    The lambda is correctly compiled because you already have a generic collection (a bag), and there is an overload which accepts the ParallelLoopState as a second parameter. 因为您已经有一个通用集合(一个袋子),并且已经接受了ParallelLoopState作为第二个参数,所以lambda正确编译。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM