简体   繁体   English

如何确保数据流块仅按需创建线程?

[英]How can I make sure a dataflow block only creates threads on a on-demand basis?

I've written a small pipeline using the TPL Dataflow API which receives data from multiple threads and performs handling on them. 我使用TPL Dataflow API编写了一个小管道,它从多个线程接收数据并对它们执行处理。

Setup 1 设置1

When I configure it to use MaxDegreeOfParallelism = Environment.ProcessorCount (comes to 8 in my case) for each block, I notice it fills up buffers in multiple threads and processing the second block doesn't start until +- 1700 elements have been received across all threads. 当我为每个块配置它使用MaxDegreeOfParallelism = Environment.ProcessorCount (在我的情况下为8 )时,我注意到它在多个线程中填充缓冲区并且处理第二个块直到+ - 1700个元素被收到才开始所有线程。 You can see this in action here . 你可以在这里看到这个。

Setup 2 设置2

When I set MaxDegreeOfParallelism = 1 then I notice all elements are received on a single thread and processing the sending already starts after +- 40 elements are received. 当我设置MaxDegreeOfParallelism = 1然后我注意到在一个线程上接收到所有元素并且在接收到MaxDegreeOfParallelism = 1个元素之后处理发送已经开始。 Data here . 数据在这里

Setup 3 设置3

When I set MaxDegreeOfParallelism = 1 and I introduce a delay of 1000ms before sending each input, I notice elements get sent as soon as they are received and every received element is put on a separate thread. 当我设置MaxDegreeOfParallelism = 1并且在发送每个输入之前我引入了1000ms的延迟时,我注意到元素一接收就会被发送,并且每个接收到的元素都被放在一个单独的线程上。 Data here . 数据在这里


So far the setup. 到目前为止的设置。 My questions are the following: 我的问题如下:

  1. When I compare setups 1 & 2 I notice that processing elements starts much faster when done in serial compared to parallel (even after accounting for the fact that parallel has 8x as many threads). 当我比较设置1和2时,我注意到与并行相比,在串行完成时处理元素的启动速度要快得多(即使考虑到并行具有8倍的线程数)。 What causes this difference? 是什么导致这种差异?

  2. Since this will be run in an ASP.NET environment, I don't want to spawn unnecessary threads since they all come from a single threadpool. 由于这将在ASP.NET环境中运行,因此我不想生成不必要的线程,因为它们都来自单个线程池。 As shown in setup 3 it will still spread itself over multiple threads even when there is only a handful of data. 如设置3所示,即使只有少量数据,它仍会在多个线程上传播。 This is also surprising because from setup 1 I would assume that data is spread sequentially over threads (notice how the first 50 elements all go to thread 16). 这也是令人惊讶的,因为从设置1我会假设数据在线程上顺序传播(注意前50个元素如何进入线程16)。 Can I make sure it only creates new threads on a on-demand basis? 我可以确保它只按需创建新线程吗?

  3. There is another concept called the BufferBlock<T> . 还有另一个名为BufferBlock<T>概念。 If the TransformBlock<T> already queues input, what would be the practical difference of swapping the first step in my pipeline ( ReceiveElement ) for a BufferBlock ? 如果TransformBlock<T>已经排队输入,那么交换BufferBlock管道( ReceiveElement )中第一步的实际区别是什么?


class Program
{
    static void Main(string[] args)
    {
        var dataflowProcessor = new DataflowProcessor<string>();
        var amountOfTasks = 5;
        var tasks = new Task[amountOfTasks];

        for (var i = 0; i < amountOfTasks; i++)
        {
            tasks[i] = SpawnThread(dataflowProcessor, $"Task {i + 1}");
        }

        foreach (var task in tasks)
        {
            task.Start();
        }

        Task.WaitAll(tasks);
        Console.WriteLine("Finished feeding threads"); // Needs to use async main
        Console.Read();
    }

    private static Task SpawnThread(DataflowProcessor<string> dataflowProcessor, string taskName)
    {
        return new Task(async () =>
        {
            await FeedData(dataflowProcessor, taskName);
        });
    }

    private static async Task FeedData(DataflowProcessor<string> dataflowProcessor, string threadName)
    {
        foreach (var i in Enumerable.Range(0, short.MaxValue))
        {
            await Task.Delay(1000); // Only used for the delayedSerialProcessing test
            dataflowProcessor.Process($"Thread name: {threadName}\t Thread ID:{Thread.CurrentThread.ManagedThreadId}\t Value:{i}");
        }
    }
}


public class DataflowProcessor<T>
{
    private static readonly ExecutionDataflowBlockOptions ExecutionOptions = new ExecutionDataflowBlockOptions
    {
        MaxDegreeOfParallelism = Environment.ProcessorCount
    };

    private static readonly TransformBlock<T, T> ReceiveElement = new TransformBlock<T, T>(element =>
    {
        Console.WriteLine($"Processing received element in thread {Thread.CurrentThread.ManagedThreadId}");
        return element;
    }, ExecutionOptions);

    private static readonly ActionBlock<T> SendElement = new ActionBlock<T>(element =>
    {
        Console.WriteLine($"Processing sent element in thread {Thread.CurrentThread.ManagedThreadId}");
        Console.WriteLine(element);
    }, ExecutionOptions);

    static DataflowProcessor()
    {
        ReceiveElement.LinkTo(SendElement);

        ReceiveElement.Completion.ContinueWith(x =>
        {
            if (x.IsFaulted)
            {
                ((IDataflowBlock) ReceiveElement).Fault(x.Exception);
            }
            else
            {
                ReceiveElement.Complete();
            }
        });
    }


    public void Process(T newElement)
    {      
        ReceiveElement.Post(newElement);
    }
}

Before you deploy your solution to the ASP.NET environment, I suggest you to change your architecture: IIS can suspend threads in ASP.NET for it's own use after the request handled so your task could be unfinished. 在将解决方案部署到ASP.NET环境之前,我建议您更改您的体系结构:IIS可以在处理请求后暂停ASP.NET中的线程以供自己使用,因此您的任务可能尚未完成。 Better approach is to create a separate windows service daemon, which handles your dataflow. 更好的方法是创建一个单独的Windows服务守护程序,它可以处理您的数据流。

Now back to the TPL Dataflow. 现在回到TPL Dataflow。

I love the TPL Dataflow library but it's documentation is a real mess. 我喜欢TPL Dataflow库,但它的文档真是一团糟。
The only useful document I've found is Introduction to TPL Dataflow . 我发现的唯一有用的文档是TPL Dataflow简介

There are some clues in it which can be helpful, especially the ones about Configuration Settings (I suggest you to investigate the implementing your own TaskScheduler with using your own TheadPool implementation, and MaxMessagesPerTask option) if you need: 其中有一些线索可能会有所帮助,特别是关于配置设置(我建议您使用自己的TheadPool实现来调查实现自己的TaskScheduler ,以及MaxMessagesPerTask选项),如果您需要:

The built-in dataflow blocks are configurable, with a wealth of control provided over how and where blocks perform their work. 内置的数据流块是可配置的,提供了丰富的控制,包括块执行工作的方式和位置。 Here are some key knobs available to the developer, all of which are exposed through the DataflowBlockOptions class and its derived types ( ExecutionDataflowBlockOptions and GroupingDataflowBlockOptions ), instances of which may be provided to blocks at construction time. 以下是开发人员可用的一些关键旋钮,所有关键旋钮都通过DataflowBlockOptions类及其派生类型( ExecutionDataflowBlockOptionsGroupingDataflowBlockOptions )公开,其实例可以在构造时提供给块。

  • TaskScheduler customization, as @i3arnon mentioned: TaskScheduler自定义,如@ i3arnon所述:

    By default, dataflow blocks schedule work to TaskScheduler.Default , which targets the internal workings of the .NET ThreadPool . 默认情况下,数据流阻止将工作安排到TaskScheduler.Default ,后者针对.NET ThreadPool的内部工作。

  • MaxDegreeOfParallelism MaxDegreeOfParallelism

    It defaults to 1 , meaning only one thing may happen in a block at a time. 它默认为1 ,这意味着一次只能在一个块中发生一件事。 If set to a value higher than 1 , that number of messages may be processed concurrently by the block. 如果设置为大于1的值,则块可以同时处理该数量的消息。 If set to DataflowBlockOptions.Unbounded (-1) , any number of messages may be processed concurrently, with the maximum automatically managed by the underlying scheduler targeted by the dataflow block. 如果设置为DataflowBlockOptions.Unbounded (-1) ,则可以同时处理任意数量的消息,最大值由数据流块所针对的基础调度程序自动管理。 Note that MaxDegreeOfParallelism is a maximum, not a requirement. 请注意, MaxDegreeOfParallelism是最大值,而不是要求。

  • MaxMessagesPerTask MaxMessagesPerTask

    TPL Dataflow is focused on both efficiency and control. TPL Dataflow专注于效率和控制。 Where there are necessary trade-offs between the two, the system strives to provide a quality default but also enable the developer to customize behavior according to a particular situation. 在两者之间存在必要的权衡的情况下,系统努力提供高质量的默认值,但也使开发者能够根据特定情况定制行为。 One such example is the trade-off between performance and fairness. 一个这样的例子是绩效和公平之间的权衡。 By default, dataflow blocks try to minimize the number of task objects that are necessary to process all of their data. 默认情况下,数据流块会尝试最小化处理其所有数据所必需的任务对象的数量。 This provides for very efficient execution; 这提供了非常有效的执行; as long as a block has data available to be processed, that block's tasks will remain to process the available data, only retiring when no more data is available (until data is available again, at which point more tasks will be spun up). 只要一个块有可供处理的数据,该块的任务将继续处理可用数据,只有在没有更多数据可用时才停止(直到数据再次可用,此时将启动更多任务)。 However, this can lead to problems of fairness. 但是,这可能导致公平问题。 If the system is currently saturated processing data from a given set of blocks, and then data arrives at other blocks, those latter blocks will either need to wait for the first blocks to finish processing before they're able to begin, or alternatively risk oversubscribing the system. 如果系统当前已经饱和处理来自给定块组的数据,然后数据到达其他块,则后面的块将需要等待第一个块在它们能够开始之前完成处理,或者可选地冒险超额订购系统。 This may or may not be the correct behavior for a given situation. 对于给定情况,这可能是也可能不是正确的行为。 To address this, the MaxMessagesPerTask option exists. 为解决此问题,存在MaxMessagesPerTask选项。 It defaults to DataflowBlockOptions.Unbounded (-1) , meaning that there is no maximum. 它默认为DataflowBlockOptions.Unbounded (-1) ,表示没有最大值。 However, if set to a positive number, that number will represent the maximum number of messages a given block may use a single task to process. 但是,如果设置为正数,则该数字将表示给定块可以使用单个任务处理的最大消息数。 Once that limit is reached, the block must retire the task and replace it with a replica to continue processing. 达到该限制后,该块必须停用该任务并将其替换为副本以继续处理。 These replicas are treated fairly with regards to all other tasks scheduled to the scheduler, allowing blocks to achieve a modicum of fairness between them. 对于安排到调度程序的所有其他任务,这些副本被公平对待,允许块在它们之间实现一点公平性。 In the extreme, if MaxMessagesPerTask is set to 1, a single task will be used per message, achieving ultimate fairness at the potential expense of more tasks than may otherwise have been necessary. 在极端情况下,如果将MaxMessagesPerTask设置为1,则每个消息将使用单个任务,以可能花费更多任务的方式实现最终的公平性。

  • MaxNumberOfGroups MaxNumberOfGroups

    The grouping blocks are capable of tracking how many groups they've produced, and automatically complete themselves (declining further offered messages) after that number of groups has been generated. 分组块能够跟踪它们生成的组数,并在生成该组数量后自动完成自身(拒绝进一步提供的消息)。 By default, the number of groups is DataflowBlockOptions.Unbounded (-1) , but it may be explicitly set to a value greater than one. 默认情况下,组的数量是DataflowBlockOptions.Unbounded(-1) ,但可以显式设置为大于1的值。

  • CancellationToken 的CancellationToken

    This token is monitored during the dataflow block's lifetime. 在数据流块的生命周期内监视此令牌。 If a cancellation request arrives prior to the block's completion, the block will cease operation as politely and quickly as possible. 如果取消请求在块完成之前到达,则块将尽可能礼貌和快速地停止操作。

  • Greedy 贪婪

    By default, target blocks are greedy and want all data offered to them. 默认情况下,目标块是贪婪的,并希望提供给它们的所有数据。

  • BoundedCapacity BoundedCapacity

    This is the limit on the number of items the block may be storing and have in flight at any one time. 这是块可能存储并且在任何时间都在飞行中的项目数量的限制。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 C# 中创建按需 DynamoDB 表? - How can I create an On-Demand DynamoDB table in C#? “或”按需使用 Linq 到实体,我该怎么做? - "OR" on-demand with Linq to Entities, how can i do it? 如何实现从数据库(按需)加载配置值的 IConfigurationProvider? - How can I implement an IConfigurationProvider which loads config values from a database (on-demand)? 如何在 MEF 中动态按需加载插件? - How can plugins be loaded dynamically and on-demand in MEF? 如何确保只能从特定的类调用类的方法? - How can I make sure that a method of class can be called from a specific class only? 如何确保只有用户登录才能下载文件? (没有表格认证) - How Can I make sure a file can be downloaded only if the user is logged in? ( no forms authentication ) 如何确保返回是一个数组? - How can I make sure the return is an array? MassTransit 是否可以仅按需使用队列中的消息? - MassTransit is it possible to consume a message from a queue only on-demand? 我如何确保仅在while循环中遇到if语句时才打印特定行 - How can I make sure that a certain line is printed only when the if statement is met in a while-loop 如何确保一次只打开一个WPF窗口? - How can I make sure only one WPF Window is open at a time?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM