简体   繁体   English

MapReduce 使用 DataFlow 库

[英]MapReduce using DataFlow library

I am trying to implement a classic map-reduce problem using System.Threading.Tasks.Dataflow , and although I can get something (sort of) working I'm struggling to see how to generalise this functionality.我正在尝试使用System.Threading.Tasks.Dataflow实现一个经典的 map-reduce 问题,虽然我可以得到一些(某种)工作,但我正在努力了解如何概括此功能。

Given a simple problem给定一个简单的问题

  • Produce a stream of integers;产生一个 stream 的整数; and in parallel for each number并平行于每个数字
    • Square the number平方数
    • add 5加 5
    • divide by 2除以 2
  • Take the sum of all numbers取所有数字的总和

The problem im having is that I can get this working using a BufferBlock , but I have to specify the initial size of the set of parallel tasks.我遇到的问题是我可以使用BufferBlock工作,但我必须指定并行任务集的初始大小。 This is fine for the test code (below) as I know upfront how many items im going to queue, but say I didnt know... how would I set this pipeline up?这对于测试代码(下面)来说很好,因为我预先知道有多少项目要排队,但说我不知道......我将如何设置这个管道?

Test code used (Note I added a short delay into the first of the "Parallel" blocks just to see some processing time difference depending on degrees of parallelism):使用的测试代码(请注意,我在“并行”块的第一个中添加了一个短暂的延迟,只是为了查看一些处理时间差异,具体取决于并行度):

using System.Diagnostics;
using System.Threading.Tasks.Dataflow;

var input = 10;

var fanOutBlock = new TransformManyBlock<int, int>(x =>
{
    return Enumerable.Range(1, x).Select(x => x);
});

var squareBlock = new TransformBlock<int, int>(async x =>
 {
     await Task.Delay(100);
     return x * x;
 }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 5 });

var addFiveBlock = new TransformBlock<int, int>(x =>
{
    return x + 5;
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 5 });

var divTwoBlock = new TransformBlock<int, double>(x =>
{
    return x/2.0;
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 5 });

var batchBlock = new BatchBlock<double>(input);

var sumBlock = new TransformBlock<IList<double>,double>(x =>
{
    return x.Sum();
});

var options = new DataflowLinkOptions { PropagateCompletion = true };

fanOutBlock.LinkTo(squareBlock, options);
squareBlock.LinkTo(addFiveBlock, options);
addFiveBlock.LinkTo(divTwoBlock, options);
divTwoBlock.LinkTo(batchBlock, options);
batchBlock.LinkTo(sumBlock, options);


var sw = Stopwatch.StartNew();
fanOutBlock.Post(input);
fanOutBlock.Complete();


var result = sumBlock.Receive();
Console.WriteLine(result);
sw.Stop();
Console.WriteLine($"{sw.ElapsedMilliseconds}ms");

await sumBlock.Completion;

One idea is to configure the BatchBlock<T> with the maximum batchSize :一种想法是使用最大batchSize配置BatchBlock<T>

var batchBlock = new BatchBlock<double>(Int32.MaxValue);

When the batchBlock is completed (when its Complete method is invoked), it will emit a batch with all the messages it contains.batchBlock完成时(当调用其Complete方法时),它将发出一个批次,其中包含它包含的所有消息。 The disadvantage is that by buffering every message, you might run out of memory in case the number of messages is huge.缺点是通过缓冲每条消息,如果消息数量巨大,您可能会用完 memory。 Or, if the number of messages is larger than Int32.MaxValue and miraculously you don't run out of memory, you'll get more than one batches, which regarding the logic that you are trying to implement will be a bug.或者,如果消息数大于Int32.MaxValue并且奇迹般地您没有用完 memory,您将获得多个批次,这对于您尝试实现的逻辑来说将是一个错误。

A better idea is to implement a custom Dataflow block that aggregates the messages it receives on the fly.一个更好的想法是实现一个自定义数据流块,该块聚合它动态接收的消息。 Something similar to the Aggregate LINQ operator:类似于Aggregate LINQ 运算符的东西:

public static TResult Aggregate<TSource, TAccumulate, TResult>(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> function,
    Func<TAccumulate, TResult> resultSelector);

Here is an implementation, that is composed by two native blocks, that are encapsulated with the DataflowBlock.Encapsulate method:这是一个实现,由两个本地块组成,用DataflowBlock.Encapsulate方法封装:

public static IPropagatorBlock<TSource, TResult>
    CreateAggregateBlock<TSource, TAccumulate, TResult>(
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> function,
    Func<TAccumulate, TResult> resultSelector,
    ExecutionDataflowBlockOptions options = default)
{
    options ??= new ExecutionDataflowBlockOptions();
    var maxDOP = options.MaxDegreeOfParallelism;
    options.MaxDegreeOfParallelism = 1;

    var inputBlock = new ActionBlock<TSource>(item =>
    {
        seed = function(seed, item);
    }, options);

    var outputBlock = new TransformBlock<TAccumulate, TResult>(accumulate =>
    {
        return resultSelector(accumulate);
    }, options);

    options.MaxDegreeOfParallelism = maxDOP; // Restore initial value

    PropagateCompletion(inputBlock, outputBlock, () =>
    {
        outputBlock.Post(seed);
    });

    return DataflowBlock.Encapsulate(inputBlock, outputBlock);

    static void PropagateCompletion(IDataflowBlock source, IDataflowBlock target,
        Action onSuccessfulCompletion)
    {
        ThreadPool.QueueUserWorkItem(async _ =>
        {
            try { await source.Completion; } catch { }
            Exception exception =
                source.Completion.IsFaulted ? source.Completion.Exception : null;
            if (source.Completion.IsCompletedSuccessfully)
            {
                // The action is invoked before completing the target.
                try { onSuccessfulCompletion(); }
                catch (Exception ex) { exception = ex; }
            }
            if (exception != null) target.Fault(exception); else target.Complete();
        });
    }
}

A tricky part is how to propagate the completion of the one block to the other.一个棘手的部分是如何将一个块的完成传播到另一个块。 My preferred technique is to invoke an async void method on the thread pool.我的首选技术是在线程池上调用async void方法。 This way any bug in my code will be exposed as a crashing unhandled exception.这样,我的代码中的任何错误都将作为崩溃的未处理异常暴露出来。 The alternative is to put the code in a fire-and-forget task continuation, in which case the effect of a bug will be most likely a silent deadlock.另一种方法是将代码置于即发即弃的任务延续中,在这种情况下,错误的影响很可能是静默死锁。

Another question mark is whether the mutations of the seed state are visible to all threads involved in the calculation.另一个问号是seed state 的突变是否对参与计算的所有线程可见。 I've avoided putting explicit barriers or lock s, and I am relying on the implicit barriers that the TPL includes when tasks are queued, and at the beginning/end of task execution.我已经避免放置显式屏障或lock ,并且我依赖于 TPL 在任务排队时以及任务执行开始/结束时包含的隐式屏障

Usage example:使用示例:

var sumBlock = CreateAggregateBlock<double, double, double>(0.0,
    (acc, x) => acc + x, acc => acc);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM