简体   繁体   English

如何恢复打乱的 Dataflow 管道的顺序?

[英]How to restore the order of a shuffled Dataflow pipeline?

I have a Dataflow pipeline that consists of multiple blocks that are processing heterogeneous documents (XLS, PDF etc).我有一个数据流管道,它由处理异构文档(XLS、PDF 等)的多个块组成。 Each type of document is processed by a dedicated TransformBlock .每种类型的文档都由专用的TransformBlock处理。 At the end of the pipeline I have an ActionBlock that receives all the processed documents, and uploads them one by one to a web server.在管道的末端,我有一个ActionBlock ,它接收所有已处理的文档,并将它们一一上传到 web 服务器。 My problem is that I can't find a way to satisfy the requirement of uploading the documents in the same order they were initially entered in the pipeline.我的问题是我找不到一种方法来满足按照最初输入管道的相同顺序上传文档的要求。 For example I can't use the EnsureOrdered option to my advantage, because this option configures the behavior of a single block, and not the behavior of multiple blocks that are working in parallel.例如,我不能使用EnsureOrdered选项对我有利,因为此选项配置单个块的行为,而不是并行工作的多个块的行为。 My requirements are:我的要求是:

  1. Insert the documents in the pipeline in a specific order.按特定顺序将文档插入管道中。
  2. Process each document differently, depending on its type.根据其类型,以不同方式处理每个文档。
  3. The documents of a specific type should be processed sequentially.特定类型的文档应按顺序处理。
  4. Documents of different types can (and should) be processed in parallel.可以(并且应该)并行处理不同类型的文档。
  5. All documents should be uploaded ASAP after they are processed.所有文件应在处理后尽快上传。
  6. The documents must be uploaded sequentially, and in the same order they were entered in the pipeline.文件必须按顺序上传,并按照它们在管道中输入的顺序。

For example it is required that the document #8 must be uploaded after the document #7, even if it is processed before the document #7.例如,要求文档#8必须在文档#7之后上传,即使它是在文档#7之前处理的。

The fifth requirement means that I can't wait for all documents to be processed, then sort them by index, and finally upload them.第五个要求就是等不及所有文档都处理完,再按索引排序,最后上传。 The uploading must happen concurrently with the processing.上传必须与处理同时进行。

Here is a minimal example of what I'm trying to do.这是我正在尝试做的一个最小示例。 For simplicity I am not feeding the blocks with instances of the IDocument interface, but with simple integers.为简单起见,我没有使用IDocument接口的实例来提供块,而是使用简单的整数。 The value of each integer represents the order in which it was entered in the pipeline, and the order in which it must be uploaded:每个 integer 的值代表它进入管道的顺序,以及必须上传的顺序:

var xlsBlock = new TransformBlock<int, int>(document =>
{
    int duration = 300 + document % 3 * 300;
    Thread.Sleep(duration); // Simulate CPU-bound work
    return document;
});
var pdfBlock = new TransformBlock<int, int>(document =>
{
    int duration = 100 + document % 5 * 200;
    Thread.Sleep(duration); // Simulate CPU-bound work
    return document;
});

var uploader = new ActionBlock<int>(async document =>
{
    Console.WriteLine($"Uploading document #{document}");
    await Task.Delay(500); // Simulate I/O-bound work
});

xlsBlock.LinkTo(uploader);
pdfBlock.LinkTo(uploader);

foreach (var document in Enumerable.Range(1, 10))
{
    if (document % 2 == 0)
        xlsBlock.Post(document);
    else
        pdfBlock.Post(document);
}
xlsBlock.Complete();
pdfBlock.Complete();
_ = Task.WhenAll(xlsBlock.Completion, pdfBlock.Completion)
    .ContinueWith(_ => uploader.Complete());

await uploader.Completion;

The output is: output 是:

Uploading document #1
Uploading document #2
Uploading document #3
Uploading document #5
Uploading document #4
Uploading document #7
Uploading document #6
Uploading document #9
Uploading document #8
Uploading document #10

( Try it on Fiddle ) 在小提琴上试试

The desirable order is #1, #2, #3, #4, #5, #6, #7, #8, #9, #10.理想的顺序是#1、#2、#3、#4、#5、#6、#7、#8、#9、#10。

How can I restore the order of the processed documents, before sending them to the uploader block?在将已处理文档发送到uploader程序块之前,如何恢复已处理文档的顺序?

Clarification: Changing drastically the schema of the pipeline, by replacing the multiple specific TransformBlock s with a single generic TransformBlock , is not an option.澄清:通过用单个通用TransformBlock替换多个特定的TransformBlock来彻底改变管道的模式不是一种选择。 The ideal scenario would be to intercept a single block between the processors and the uploader, that will restore the order of the documents.理想的情况是拦截处理器和上传者之间的单个块,这将恢复文档的顺序。

uploader should add the document into some sorted list of completed documents, and check if the added document is the one that it should be uploaded next. uploader应该将文档添加到已完成文档的排序列表中,并检查添加的文档是否是下一个应该上传的文档。 If it should then remove and upload all the documents from the sorted list until there is one missing.如果它应该从排序列表中删除并上传所有文档,直到缺少一个。

There is also a synchronization problem.还有一个同步问题。 Access to this sorted list must be synchronized across threads.对该排序列表的访问必须跨线程同步。 But you want all threads to be doing something instead of waiting for other threads to complete their work.但是您希望所有线程都在做某事,而不是等待其他线程完成它们的工作。 So, uploader should work with the list like this:因此, uploader应该使用这样的列表:

  • Within sync lock add new document to the list, and release lock在同步锁内将新文档添加到列表中,并释放锁
  • In a loop在一个循环中
    • enter the same sync lock again,再次进入同一个同步锁,
    • if upload_in_progress flag is set then do nothing and return.如果设置upload_in_progress标志,则什么也不做并返回。
    • check if document on top of the list should be uploaded,检查是否应上传列表顶部的文档,
      • if not then reset upload_in_progress flag, and return.如果没有,则重置upload_in_progress标志,然后返回。
      • otherwise remove the document from the list,否则从列表中删除该文档,
      • set upload_in_progress flag,设置upload_in_progress标志,
      • release lock,释放锁,
      • upload the document.上传文件。

I hope I imagined it right.我希望我想象的没错。 As you can see it's tricky do make it both safe and efficient.正如你所看到的,让它既安全又高效是很棘手的。 There's surely a way to do it with only one lock in most of the cases, but it wouldn't add too much to efficiency.在大多数情况下,肯定有一种方法可以只使用一个锁,但它不会增加太多效率。 The upload_in_progress flag is shared between tasks, like the list itself. upload_in_progress标志在任务之间共享,就像列表本身一样。

I managed to implement a dataflow block that can restore the order of my shuffled pipeline, based on Dialecticus idea of a sorted list that contains the processed documents.我设法实现了一个数据流块,该块可以恢复我的混洗管道的顺序,基于 Dialecticus 的包含已处理文档的排序列表的想法 Instead of a SortedList I ended up using a simple Dictionary , that seems to work just as well.我最终使用了一个简单的Dictionary而不是SortedList ,这似乎也同样有效。

/// <summary>Creates a dataflow block that restores the order of
/// a shuffled pipeline.</summary>
public static IPropagatorBlock<T, T> CreateRestoreOrderBlock<T>(
    Func<T, long> indexSelector,
    long startingIndex = 0L,
    DataflowBlockOptions options = null)
{
    if (indexSelector == null) throw new ArgumentNullException(nameof(indexSelector));
    var executionOptions = new ExecutionDataflowBlockOptions();
    if (options != null)
    {
        executionOptions.CancellationToken = options.CancellationToken;
        executionOptions.BoundedCapacity = options.BoundedCapacity;
        executionOptions.EnsureOrdered = options.EnsureOrdered;
        executionOptions.TaskScheduler = options.TaskScheduler;
        executionOptions.MaxMessagesPerTask = options.MaxMessagesPerTask;
        executionOptions.NameFormat = options.NameFormat;
    }

    var buffer = new Dictionary<long, T>();
    long minIndex = startingIndex;

    IEnumerable<T> Transform(T item)
    {
        // No synchronization needed because MaxDegreeOfParallelism = 1
        long index = indexSelector(item);
        if (index < startingIndex)
            throw new InvalidOperationException($"Index {index} is out of range.");
        if (index < minIndex)
            throw new InvalidOperationException($"Index {index} has been consumed.");
        if (!buffer.TryAdd(index, item)) // .NET Core only API
            throw new InvalidOperationException($"Index {index} is not unique.");
        while (buffer.Remove(minIndex, out var minItem)) // .NET Core only API
        {
            minIndex++;
            yield return minItem;
        }
    }

    // Ideally the assertion buffer.Count == 0 should be checked on the completion
    // of the block.
    return new TransformManyBlock<T, T>(Transform, executionOptions);
}

Usage example:使用示例:

var xlsBlock = new TransformBlock<int, int>(document =>
{
    int duration = 300 + document % 3 * 300;
    Thread.Sleep(duration); // Simulate CPU-bound work
    return document;
});
var pdfBlock = new TransformBlock<int, int>(document =>
{
    int duration = 100 + document % 5 * 200;
    Thread.Sleep(duration); // Simulate CPU-bound work
    return document;
});

var orderRestorer = CreateRestoreOrderBlock<int>(
    indexSelector: document => document, startingIndex: 1L);

var uploader = new ActionBlock<int>(async document =>
{
    Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} Uploading document #{document}");
    await Task.Delay(500); // Simulate I/O-bound work
});

xlsBlock.LinkTo(orderRestorer);
pdfBlock.LinkTo(orderRestorer);
orderRestorer.LinkTo(uploader, new DataflowLinkOptions { PropagateCompletion = true });

foreach (var document in Enumerable.Range(1, 10))
{
    if (document % 2 == 0)
        xlsBlock.Post(document);
    else
        pdfBlock.Post(document);
}
xlsBlock.Complete();
pdfBlock.Complete();
_ = Task.WhenAll(xlsBlock.Completion, pdfBlock.Completion)
    .ContinueWith(_ => orderRestorer.Complete());

await uploader.Completion;

Output: Output:

09:24:18.846 Uploading document #1
09:24:19.436 Uploading document #2
09:24:19.936 Uploading document #3
09:24:20.441 Uploading document #4
09:24:20.942 Uploading document #5
09:24:21.442 Uploading document #6
09:24:21.941 Uploading document #7
09:24:22.441 Uploading document #8
09:24:22.942 Uploading document #9
09:24:23.442 Uploading document #10

( Try it on Fiddle , featuring a .NET Framework compatible version) 在 Fiddle 上试用,具有 .NET 框架兼容版本)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM