简体   繁体   English

如何拆分和合并此数据流管道?

[英]How do I split and merge this dataflow pipeline?

I am trying to create a dataflow using tpl with the following form:我正在尝试使用具有以下形式的 tpl 创建数据流:

                    -> LoadDataBlock1 -> ProcessDataBlock1 ->  
GetInputPathsBlock  -> LoadDataBlock2 -> ProcessDataBlock2 -> MergeDataBlock -> SaveDataBlock
                    -> LoadDataBlock3 -> ProcessDataBlock3 ->
                    ...                             
                    -> LoadDataBlockN -> ProcessDataBlockN ->

The idea is, that GetInputPathsBlock is a block, which finds the paths to the input data that is to be loaded, and then sends the path to each LoadDataBlock .这个想法是, GetInputPathsBlock是一个块,它找到要加载的输入数据的路径,然后将路径发送到每个LoadDataBlock The LoadDataBlocks are all identical (except that they have each recieved a unique inputPath string from GetInputPaths). LoadDataBlocks 都是相同的(除了它们每个都从 GetInputPaths 接收到一个唯一的 inputPath 字符串)。 The loaded data is then sent to the ProcessDataBlock , which does some simple processing.然后将加载的数据发送到ProcessDataBlock ,它会进行一些简单的处理。 Then the data from each ProcessDataBlock is sent to MergeDataBlock , which merges it and sends it to SaveDataBlock , which then saves it to a file.然后来自每个ProcessDataBlock的数据被发送到MergeDataBlock ,MergeDataBlock 合并它并将其发送到SaveDataBlock ,然后将其保存到文件中。

Think of it as a dataflow that needs to run for each month.将其视为需要每个月运行的数据流。 First the path is found for the data for each day.首先为每天的数据找到路径。 Each day's data is loaded and processed, and then merged together for the entire month and saved.每天的数据都被加载和处理,然后在整个月中合并在一起并保存。 Each month can be run parallelly, data for each day in a month can be loaded parallelly and processed parallelly (after the individual day data has been loaded), and once everything for the month has been loaded and processed, it can be merged and saved.每个月都可以并行运行,一个月中每一天的数据可以并行加载并并行处理(在单个日期数据加载完成后),当月的所有内容都加载并处理完毕后,可以合并并保存.

What I tried我试过的

As far as I can tell TransformManyBlock<TInput,string> can be used to do the splitting ( GetInputPathsBlock ), and can be linked to a normal TransformBlock<string,InputData> ( LoadDataBlock ), and from there to another TransformBlock<InputData,ProcessedData> ( ProcessDataBlock ), but I don't know how to then merge it back to a single block.据我所知TransformManyBlock<TInput,string>可用于进行拆分( GetInputPathsBlock ),并且可以链接到普通的TransformBlock<string,InputData>LoadDataBlock ),然后从那里链接到另一个TransformBlock<InputData,ProcessedData>ProcessDataBlock ),但我不知道如何将其合并回单个块。

What I looked at我看过的

I found this answer , which uses TransformManyBlock to go from an IEnumerable<item> to item , but I don't fully understand it, and I can't link a TransformBlock<InputData,ProcessedData> ( ProcessDataBlock ) to a TransformBlock<IEnumerable<ProcessedData>>,ProcessedData> , so I don't know how to use it.我找到了这个答案,它使用TransformManyBlock到 go 从IEnumerable<item>item ,但我不完全理解它,我无法将TransformBlock<InputData,ProcessedData>ProcessDataBlock )链接到TransformBlock<IEnumerable<ProcessedData>>,ProcessedData> ,所以不知道怎么用。

I have also seen answers like this , which suggests using JoinBlock , but the number of input files N varies, and the files are all loaded in the same way anyway.我也看到过这样的答案,它建议使用JoinBlock ,但是输入文件的数量 N 不同,并且文件都以相同的方式加载。

There is also this answer , which seems to do what I want, but I don't fully understand it, and I don't know how the setup with the dictionary would be transferred to my case.还有这个答案,它似乎做了我想要的,但我不完全理解它,我不知道如何将字典的设置转移到我的案例中。

How do I split and merge my dataflow?如何拆分和合并我的数据流?

  • Is there a block type I am missing是否有我缺少的块类型
  • Can I somehow use TransformManyBlock twice?我可以以某种方式使用TransformManyBlock两次吗?
  • Does tpl make sense for the split/merge or is there a simpler async/await way? tpl 对拆分/合并有意义还是有更简单的异步/等待方式?

I would use a nested block to avoid splitting my monthly data and then having to merge them again.我会使用嵌套块来避免拆分我的每月数据,然后再次合并它们。 Here is an example of two nested TransformBlock s that process all days of the year 2020:以下是处理 2020 年所有日子的两个嵌套TransformBlock的示例:

var monthlyBlock = new TransformBlock<int, List<string>>(async (month) =>
{
    var dailyBlock = new TransformBlock<int, string>(async (day) =>
    {
        await Task.Delay(100); // Simulate async work
        return day.ToString();
    }, new ExecutionDataflowBlockOptions() { MaxDegreeOfParallelism = 4 });

    foreach (var day in Enumerable.Range(1, DateTime.DaysInMonth(2020, month)))
        await dailyBlock.SendAsync(day);
    dailyBlock.Complete();

    var dailyResults = await dailyBlock.ToListAsync();
    return dailyResults;
}, new ExecutionDataflowBlockOptions() { MaxDegreeOfParallelism = 1 });

foreach (var month in Enumerable.Range(1, 12))
    await monthlyBlock.SendAsync(month);
monthlyBlock.Complete();

For collecting the daily results of the inner block I used the extension method ToListAsync that is shown below:为了收集内部块的每日结果,我使用了扩展方法ToListAsync ,如下所示:

public static async Task<List<T>> ToListAsync<T>(this IReceivableSourceBlock<T> block,
    CancellationToken cancellationToken = default)
{
    var list = new List<T>();
    while (await block.OutputAvailableAsync(cancellationToken).ConfigureAwait(false))
    {
        while (block.TryReceive(out var item))
        {
            list.Add(item);
        }
    }
    await block.Completion.ConfigureAwait(false); // Propagate possible exception
    return list;
}

The answers to your questions are: no, you don't need another block type, yes, you can use TransformManyBlock twice, and yes, it does make sense.您的问题的答案是:不,您不需要其他块类型,是的,您可以使用 TransformManyBlock 两次,是的,它确实有意义。 I wrote some code to prove it, which is at the bottom, and some notes on how it works, which are after that.我写了一些代码来证明它,它在底部,还有一些关于它如何工作的注释,在这之后。

The code uses a split then merge pipeline as you describe.正如您所描述的,该代码使用拆分然后合并管道。 As for the bit you were struggling with: merging the data for individual files back together can be done by adding processed items to a list as they become available.至于您正在努力解决的问题:可以通过将已处理的项目添加到列表中来完成将各个文件的数据合并在一起。 Then we only pass the list on to the next block if it has the expected final number of items.然后,如果它具有预期的最终项目数,我们只将列表传递到下一个块。 This can be done with a fairly simple TransformMany block returning zero or one items.这可以通过返回零或一项的相当简单的 TransformMany 块来完成。 This block can't be parallelized because the list isn't threadsafe.此块无法并行化,因为该列表不是线程安全的。

Once you've got a pipeline like this you can test the parallelization and ordering by just using the options passed to the blocks.一旦你有了这样的管道,你就可以通过使用传递给块的选项来测试并行化和排序。 The code below sets parallelization to unbounded for every block it can, and lets the DataFlow code sort it out.下面的代码将每个块的并行化设置为无界,并让 DataFlow 代码对其进行排序。 On my machine it maxes out all the cores/logical processors and is CPU-bound, which is what we want.在我的机器上,它最大化了所有内核/逻辑处理器,并且受 CPU 限制,这正是我们想要的。 Ordering is enabled, but turning that off doesn't make much difference: again, we are CPU-bound.排序已启用,但关闭它并没有太大区别:同样,我们受 CPU 限制。

Finally I have to say this is a very cool tech, but you can actually solve this problem much more simply using PLINQ, where it's just a few lines of code to get something just as fast.最后,我不得不说这是一项非常酷的技术,但您实际上可以更简单地使用 PLINQ 解决这个问题,只需几行代码即可获得同样快的结果。 The big drawback is that you can't easily incrementally add fast-arriving messages to a pipeline if you do that: PLINQ is better-suited to one big batch process.最大的缺点是,如果这样做,您将无法轻松地将快速到达的消息逐步添加到管道中:PLINQ 更适合一个大批量进程。 However PLINQ may be a better solution for your usecase.但是,对于您的用例,PLINQ 可能是更好的解决方案。

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Threading;
using System.Threading.Tasks.Dataflow;

namespace ParallelDataFlow
{
    class Program
    {
        static void Main(string[] args)
        {
            new Program().Run();
            Console.ReadLine();
        }

        private void Run()
        {
            Stopwatch s = new Stopwatch();
            s.Start();

            // Can  experiment with parallelization of blocks by changing MaxDegreeOfParallelism
            var options = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded };
            var getInputPathsBlock = new TransformManyBlock<(int, int), WorkItem>(date => GetWorkItemWithInputPath(date), options);
            var loadDataBlock = new TransformBlock<WorkItem, WorkItem>(workItem => LoadDataIntoWorkItem(workItem), options);
            var processDataBlock = new TransformBlock<WorkItem, WorkItem>(workItem => ProcessDataForWorkItem(workItem), options);
            var waitForProcessedDataBlock = new TransformManyBlock<WorkItem, List<WorkItem>>(workItem => WaitForWorkItems(workItem));  // Can't parallelize this block
            var mergeDataBlock = new TransformBlock<List<WorkItem>, List<WorkItem>>(list => MergeWorkItemData(list), options);
            var saveDataBlock = new ActionBlock<List<WorkItem>>(list => SaveWorkItemData(list), options);

            var linkOptions = new DataflowLinkOptions { PropagateCompletion = true };
            getInputPathsBlock.LinkTo(loadDataBlock, linkOptions);
            loadDataBlock.LinkTo(processDataBlock, linkOptions);
            processDataBlock.LinkTo(waitForProcessedDataBlock, linkOptions);
            waitForProcessedDataBlock.LinkTo(mergeDataBlock, linkOptions);
            mergeDataBlock.LinkTo(saveDataBlock, linkOptions);

            // We post individual tuples of (year, month) to our pipeline, as many as we want
            getInputPathsBlock.Post((1903, 2));  // Post one month and date
            var dates = from y in Enumerable.Range(2015, 5) from m in Enumerable.Range(1, 12) select (y, m);
            foreach (var date in dates) getInputPathsBlock.Post(date);  // Post a big sequence         

            getInputPathsBlock.Complete();
            saveDataBlock.Completion.Wait();
            s.Stop();
            Console.WriteLine($"Completed in {s.ElapsedMilliseconds}ms on {ThreadAndTime()}");
        }

        private IEnumerable<WorkItem> GetWorkItemWithInputPath((int year, int month) date)
        {
            List<WorkItem> processedWorkItems = new List<WorkItem>();  // Will store merged results
            return GetInputPaths(date.year, date.month).Select(
                path => new WorkItem
                {
                    Year = date.year,
                    Month = date.month,
                    FilePath = path,
                    ProcessedWorkItems = processedWorkItems
                });
        }

        // Get filepaths of form e.g. Files/20191101.txt  These aren't real files, they just show how it could work.
        private IEnumerable<string> GetInputPaths(int year, int month) =>
            Enumerable.Range(0, GetNumberOfFiles(year, month)).Select(i => $@"Files/{year}{Pad(month)}{Pad(i + 1)}.txt");

        private int GetNumberOfFiles(int year, int month) => DateTime.DaysInMonth(year, month);

        private WorkItem LoadDataIntoWorkItem(WorkItem workItem) {
            workItem.RawData = LoadData(workItem.FilePath);
            return workItem;
        }

        // Simulate loading by just concatenating to path: in real code this could open a real file and return the contents
        private string LoadData(string path) => "This is content from file " + path;

        private WorkItem ProcessDataForWorkItem(WorkItem workItem)
        {
            workItem.ProcessedData = ProcessData(workItem.RawData);
            return workItem;
        }

        private string ProcessData(string contents)
        {
            Thread.SpinWait(11000000); // Use 11,000,000 for ~50ms on Windows .NET Framework.  1,100,000 on Windows .NET Core.
            return $"Results of processing file with contents '{contents}' on {ThreadAndTime()}";
        }

        // Adds a processed WorkItem to its ProcessedWorkItems list.  Then checks if the list has as many processed WorkItems as we 
        // expect to see overall.  If so the list is returned to the next block, if not we return an empty array, which passes nothing on.
        // This isn't threadsafe for the list, so has to be called with MaxDegreeOfParallelization = 1
        private IEnumerable<List<WorkItem>> WaitForWorkItems(WorkItem workItem)
        {
            List<WorkItem> itemList = workItem.ProcessedWorkItems;
            itemList.Add(workItem);
            return itemList.Count == GetNumberOfFiles(workItem.Year, workItem.Month) ? new[] { itemList } : new List<WorkItem>[0];
        }

        private List<WorkItem> MergeWorkItemData(List<WorkItem> processedWorkItems)
        {
            string finalContents = "";
            foreach (WorkItem workItem in processedWorkItems)
            {
                finalContents = MergeData(finalContents, workItem.ProcessedData);
            }
            // Should really create a new data structure and return that, but let's cheat a bit
            processedWorkItems[0].MergedData = finalContents;
            return processedWorkItems;
        }

        // Just concatenate the output strings, separated by newlines, to merge our data
        private string MergeData(string output1, string output2) => output1 != "" ? output1 + "\n" + output2 : output2;

        private void SaveWorkItemData(List<WorkItem> workItems)
        {
            WorkItem result = workItems[0];
            SaveData(result.MergedData, result.Year, result.Month);
            // Code to show it's worked...
            Console.WriteLine($"Saved data block for {DateToString((result.Year, result.Month))} on {ThreadAndTime()}." +
                              $"  File contents:\n{result.MergedData}\n");
        }
        private void SaveData(string finalContents, int year, int month)
        {
            // Actually save, although don't really need to in this test code
            new DirectoryInfo("Results").Create();
            File.WriteAllText(Path.Combine("Results", $"results{year}{Pad(month)}.txt"), finalContents);
        }

        // Helper methods
        private string DateToString((int year, int month) date) => date.year + Pad(date.month);
        private string Pad(int number) => number < 10 ? "0" + number : number.ToString();
        private string ThreadAndTime() => $"thread {Pad(Thread.CurrentThread.ManagedThreadId)} at {DateTime.Now.ToString("hh:mm:ss.fff")}";
    }

    public class WorkItem
    {
        public int Year { get; set; }
        public int Month { get; set; }
        public string FilePath { get; set; }
        public string RawData { get; set; }
        public string ProcessedData { get; set; }
        public List<WorkItem> ProcessedWorkItems { get; set; }
        public string MergedData { get; set; }
    }
}

This code passes a WorkItem object from each block to the next and enriches it at each stage.此代码将 WorkItem object 从每个块传递到下一个块,并在每个阶段对其进行丰富。 It then creates a final list with all the WorkItems for a month in it, before running an aggregation process on that and saving the results.然后它会创建一个包含一个月内所有 WorkItems 的最终列表,然后对其运行聚合过程并保存结果。

This code is based on dummy methods for each stage using the names you use.此代码基于使用您使用的名称的每个阶段的虚拟方法。 These don't do much but hopefully demonstrate the solution.这些并没有做太多,但希望能证明解决方案。 For example LoadData is handed a file path and just adds some text to it and passes the string on, but obviously it could load a real file and pass the contents string on if there actually was a file on disk.例如,LoadData 被传递一个文件路径,只是向其中添加一些文本并传递字符串,但显然它可以加载一个真实文件并传递内容字符串,如果磁盘上确实有一个文件。

Similarly to simulate doing work in ProcessData we do a Thread.SpinWait and then again just add some text to the string.与在 ProcessData 中模拟工作类似,我们执行 Thread.SpinWait,然后再次向字符串添加一些文本。 This is where the delay comes from, so change the number if you want it to run faster or slower.这就是延迟的来源,因此如果您希望它运行得更快或更慢,请更改数字。 The code was written on the .NET Framework, but it runs on Core 3.0, and on Ubuntu and OSX.该代码是在 .NET 框架上编写的,但它在 Core 3.0 以及 Ubuntu 和 OSX 上运行。 The only difference is that a SpinWait cycle can be significantly longer or shorter, so you may want to play with the delay.唯一的区别是 SpinWait 周期可以明显更长或更短,因此您可能希望使用延迟。

Note that we could have merged in the waitForProcessedDataBlock and had exactly the pipeline you were asking for.请注意,我们可以在 waitForProcessedDataBlock 中合并,并拥有您所要求的管道。 It just would have been a bit more confusing只是会更混乱一些

The code does create files on disk at the end, but also dumps the results to the screen, so it doesn't really need to.该代码最终会在磁盘上创建文件,但也会将结果转储到屏幕上,因此它实际上并不需要。

If you set parallelization to 1 you'll find it slows down by about the amount you'd expect.如果您将并行化设置为 1,您会发现它会减慢您预期的速度。 My Windows machine is four-core and it's slightly worse than four times slower.我的 Windows 机器是四核的,它比慢四倍略差。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM