简体   繁体   English

TPL Parallel.ForEach中的每个线程实例对象

[英]Per-thread instance object in TPL Parallel.ForEach

Is there a TPL syntax which allows you to inject objects from pool into tasks so that one object is only used by one thread at once? 是否有TPL语法允许您将对象从池中注入到任务中,以便一个线程一次只能使用一个对象? Or even better - only used by the same once thread? 甚至更好-仅由同一个线程使用?

Usage example 使用范例

Assume I want to create 10 threads which open 10 files: 1.txt , 2.txt , 3.txt ... 10.txt and write 500 000 consequent numbers randomly to these files. 假设我想创建10个线程来打开10个文件: 1.txt2.txt3.txt ... 10.txt并将500 000个后续数字随机写入这些文件。

I can do this: 我可以做这个:

ConcurrentQueue<int> objs = new ConcurrentQueue<int>(); // 500000 numbers go here
Task[] tasks = Enumerable.Range(1, 10)
    .Select(i =>
    {
        return Task.Factory.StartNew(() => 
        {
            using (var f = File.Open($"{i}.txt"))
            {
                using (var wr = StreamWriter(f))
                {
                    while (objs.TryDequeue(out int obj))
                    {
                        wr.WriteLine(obj);
                    }
                }
            }
        }
    })
    .ToArray();
Task.WaitAll(tasks);

However, is it possible to provide the same behaviour without utilizing concurrent collections, just with TPL? 但是,是否仅使用TPL就可以在不利用并发集合的情况下提供相同的行为?

It would be better if everything except the last two edits was removed. 如果将最后两个编辑以外的所有内容都删除,那会更好。

If the question is Can you pass an object per task (not thread) when using Parallel. 如果问题是Can you pass an object per task (not thread) when using Parallel.是否Can you pass an object per task (not thread) when using Parallel. ? The answer is : Yes you can, through any of the overloads that accept local state, ie have a TLocal type like this one : 答案是:是的,你可以通过任何重载接受本地状态,即有TLocal型像这一个

public static ParallelLoopResult ForEach<TSource, TLocal>(
    IEnumerable<TSource> source,
    Func<TLocal> localInit,
    Func<TSource, ParallelLoopState, TLocal, TLocal> body,
    Action<TLocal> localFinally
)

Parallel.For doesn't use threads. Parallel.For不使用线程。 It partitions the data and creates one task for each partitions. 它对数据进行分区,并为每个分区创建一个任务。 Each task ends up processing all of a partition's data. 每个任务最终都会处理分区的所有数据。 Typically, Parallel uses as many tasks as there are cores. 通常, Parallel使用与内核一样多的任务。 It also uses the current thread for processing, which is why it appears to block the current thread. 它还使用当前线程进行处理,这就是为什么它似乎阻塞了当前线程的原因。 It doesn't, it's begin used to process one of the partitions. 并非如此,它开始用于处理其中一个分区。

The functions that deal with local data allow you to generate an initial local value and pass it to each body invocation. 处理本地数据的函数使您可以生成初始本地值,并将其传递给每个body调用。 All overloads with local data require the body to retun the (possibly modified) data, so Parallel itself doesn't have to store it. 本地数据的所有重载都要求body重新调整(可能已修改的)数据,因此Parallel本身不必存储它。 This is essential, since Parallel. 由于Parallel. ,这是必不可少的Parallel. can terminate and restart tasks. 可以终止并重新启动任务。 It wouldn't be able to do so easily or efficiently if it had to keep track of local data. 如果必须跟踪本地数据,它将无法轻松或高效地做到这一点。

For this particular example, and bypassing the fact that ORMs are unsuitable for bulk operations, especially when dealing with hundreds of thousands of objects, localInit should create a new session. 对于此特定示例,并绕过ORM不适合批量操作的事实,尤其是在处理成千上万个对象时, localInit应该创建一个新会话。 body should use and return that session while finally, localFinally should dispose it. body应使用并返回该会话,而最后, localFinally应该将其处置。

var mySessionFactory
var myData=....;
Parallel.ForEach(
    myData,
    ()=>CreateSession(),
    (record,state,session)=>{
        //process the data etc.
        return session;
    },
    (session)=>session.Dispose()
);

Some more warnings though. 不过,还有一些警告。 NH keeps changes in memory until they are flushed and the cache cleared out. NH将更改保留在内存中,直到将其清除并清除缓存。 This will create memory issues. 这将导致内存问题。 One solution would be to keep count and flush the data periodically. 一种解决方案是保持计数并定期刷新数据。 Instead of a session, the state could be a (int counter,Session session) tupple: 代替会话,状态可以是(int counter,Session session) tupple:

Parallel.ForEach(
    myData,
    ()=>(counter:0,session:CreateSession()),
    (record,state,localData)=>{
        var (counter,session)=localData;
        //process the data etc.
        if (counter % 1000 ==0)
        {
            session.Flush();
            session.Clear();
        }
        return (++counter,session);
    },
    data=>data.session.Dispose()
);

A better solution would be to batch the objects in advance, so that instead of an IEnumerable<MyRecord> the loop would work on IEnumerable<MyRecord[]> arrays. 更好的解决方案是预先批处理对象,以使循环可以在IEnumerable<MyRecord[]>数组上运行,而不是IEnumerable<MyRecord> In conjuction with batched statements this would reduce the performance penalty imposed by ORMs on bulk operations. 结合批处理语句,这将减少ORM对批量操作施加的性能损失。

Writing a Batch method isn't hard, but MoreLinq already provides one, available as source or a NuGet package : 编写Batch方法并不难,但是MoreLinq已经提供了一个方法,可以作为源代码或NuGet包使用:

var myBatches=myData.Batch(1000);
Parallel.ForEach(
    myBatches,
    ()=>CreateSession(),
    (records,state,session)=>{

        foreach(var record in records)
        {
            //process the data etc.
            session.Save(record);                
        }
        session.Flush();
        session.Clear();
        return session;
    },
    data=>data.session.Dispose()
);

No, there is not. 不,那里没有。

The closest solution is to create N threads manually (either with Task or Parallel.For / Parallel.ForEach ) and use ConcurrentQueue to distribute data thread-safely. 最接近的解决方案是手动创建N个线程(使用TaskParallel.For / Parallel.ForEach ),并使用ConcurrentQueue安全地分发数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM