简体   繁体   English

定义 BoundedCapacity 会降低性能

[英]Defining BoundedCapacity degrades performance

Is there any way to limit performance degradation with TPL Dataflow throttling?有没有办法通过 TPL 数据流节流来限制性能下降?

I have a complicated pipeline of components and trying to limit a memory requirements needed.我有一个复杂的组件管道,并试图限制所需的内存要求。 I read in parallel from multiple files, components in a pipeline might do some addition reading from random part of those files, the rest of components do CPU bound operations.我从多个文件中并行读取,管道中的组件可能会从这些文件的随机部分进行一些额外的读取,其余组件执行 CPU 绑定操作。

I simplified a performance test-bench into these tests using a common test method.我使用通用测试方法将性能测试台简化为这些测试。

private void TPLPerformaceTest(int generateNumbers,
    ExecutionDataflowBlockOptions transformBlockOptions)
{
    var transformBlock = new TransformBlock<int, int>(i => i, transformBlockOptions);

    var storedCount = 0;
    var generatedCount = 0;
    var store = new ActionBlock<int>(i => Interlocked.Increment(ref storedCount));

    transformBlock.LinkTo(store);
    transformBlock.Completion.ContinueWith(_ => store.Complete());

    for (int i = 0; i < generateNumbers; i++)
    {
        transformBlock.SendAsync(i).Wait(); //To ensure delivery
        Interlocked.Increment(ref generatedCount);
    }
    transformBlock.Complete();
    store.Completion.Wait();

    Assert.IsTrue(generatedCount == generateNumbers);
    Assert.IsTrue(storedCount == generateNumbers);
}

The first that has no throttling .第一个没有节流 On my CPU it takes about 12s to complete, consumes about 800MB of RAM and average CPU utilization is about 35% .在我的 CPU 上,它需要大约12 秒才能完成,消耗大约800MB的 RAM,平均 CPU 利用率约为35%

[Test]
public void TPLPerformaceUnlimitedTest()
{
    var generateNumbers = 100_000_000;
    this.TPLPerformaceTest(generateNumbers,new ExecutionDataflowBlockOptions());
}

Second test that just sets BoundedCapacity to int.MaxValue , thus does no limitation at all, takes 20-30s to complete, consumes 2.1GB of RAM and average CPU utilization is about 50% .第二个测试只是将BoundedCapacity设置为 int.MaxValue ,因此完全没有限制,需要20-30 秒才能完成,消耗2.1GB RAM,平均 CPU 利用率约为50% According to manual, BoundedCapacity should be set to int.MaxValue by default, so I do not see the reason for performance degradation.根据手册,BoundedCapacity 应该默认设置为 int.MaxValue,所以我看不出性能下降的原因。

[Test]
[Sequential]
public void TPLPerformaceBounedCapacityTest()
{
    var generateNumbers = 100_000_000;
    this.TPLPerformaceTest(generateNumbers,new ExecutionDataflowBlockOptions()
      { BoundedCapacity = Int32.MaxValue });
}

The third test limit BoundedCapacity to generateNumbers / 1000 , ergo 100,000.第三个测试将BoundedCapacity限制为 generateNumbers / 1000 ,因此为 100,000。 It takes 60s to complete and consumes 450MB of RAM and average CPU utilization is about 60% .它需要 60 秒才能完成并消耗450MB的 RAM,平均 CPU 利用率约为60%

[Test]
[Sequential]
public void TPLPerformaceBounedCapacityTenthTest()
{
    var generateNumbers = 100_000_000;
    this.TPLPerformaceTest(generateNumbers,new ExecutionDataflowBlockOptions()
      { BoundedCapacity = generateNumbers / 1000 });
}

The fourth test limits MaxDegreeOfParallelism to -1 which is according to the manual no limit.第四个测试将MaxDegreeOfParallelism限制为 -1 ,根据手册无限制。 It consumed 27GB of RAM and average CPU utilization was about 85% and has not finished in 5 minutes .它消耗了27GB的 RAM,平均 CPU 利用率约为85%,并且在5 分钟内还没有完成。

[Test]
[Sequential]
public void TPLPerformaceMaxDegreeOfParallelismTest()
{
    var generateNumbers = 100_000_000;
    this.TPLPerformaceTest(generateNumbers, new ExecutionDataflowBlockOptions()
      { MaxDegreeOfParallelism = -1 });
}

All methods seems to affect performance really hard and do not behave due to my reasonable expectations.所有方法似乎都很难影响性能,并且由于我的合理期望而表现不佳。

You're degrading the performance because of this line: 由于此行,您正在降低性能:

 
 
 
  
  transformBlock.SendAsync(i).Wait(); //To ensure delivery
 
 

This blocks current thread before the delivery is done. 这会在交付完成之前 阻塞当前线程。 You should switch to await to free the thread for other tasks to proceed: 您应该切换到 await以释放线程以继续执行其他任务:

 
 
 
  
  await transformBlock.SendAsync(i); //To ensure delivery
 
 

Update:更新:

I'm confused about your words that我对你的话感到困惑

According to manual, BoundedCapacity should be set to int.MaxValue by default根据手册, BoundedCapacity默认设置为int.MaxValue

because this is not true, from official documentation :因为这不是真的, 来自官方文档

BoundedCapacity
The majority of the dataflow blocks included in System.Threading.Tasks.Dataflow.dll support the specification of a bounded capacity. System.Threading.Tasks.Dataflow.dll包含的大多数数据流块都支持有界容量的规范。
This is the limit on the number of items the block may be storing and have in flight at any one time.这是块在任何时候可以存储和飞行的项目数量的限制。
By default, this value is initialized to DataflowBlockOptions.Unbounded ( -1 ), meaning that there is no limit.默认情况下,此值初始化为DataflowBlockOptions.Unbounded ( -1 ),这意味着没有限制。

Here you can see all the default values after running this code:运行此代码后,您可以在此处看到所有默认值:

 var options = new ExecutionDataflowBlockOptions();

在此处输入图片说明

So second test with BoundedCapacity set to int.MaxValue does add a limitation, which add some checks for place availability in block's buffers.因此,将BoundedCapacity设置为int.MaxValue第二个测试确实添加了一个限制,它添加了一些检查块缓冲区中的位置可用性。

You can see similar behavior in third test, which consumes much less memory than second, but do more checks and waiting times to buffers to free the space, so it works slower, but with tiny memory to allocate.您可以在第三个测试中看到类似的行为,它消耗的内存比第二个少得多,但会执行更多的检查和等待缓冲区以释放空间,因此它的工作速度较慢,但​​分配的内存很小。

Also, you can see on the screenshot that the MaxDegreeOfParallelism is equal to 1 :此外,您可以在屏幕截图上看到MaxDegreeOfParallelism等于1

MaxDegreeOfParallelism
By default, an individual dataflow block processes only one message at a time , queueing all not-yet-processed messages so that they may be processed when the currently processing message is completed.默认情况下,单个数据流块一次仅处理一条消息,将所有尚未处理的消息排队,以便在当前处理的消息完成时对其进行处理。

After setting this parameter to -1 , you open the Pandora's Box, because all the messages are being executed at the same time by the same task scheduler , according to the documentation again:将此参数设置为-1 ,您打开了潘多拉魔盒,因为所有消息都由同一个任务调度程序在同一时间执行再次 根据文档

If set to DataflowBlockOptions.Unbounded ( -1 ), any number of messages may be processed concurrently, with the maximum automatically managed by the underlying scheduler targeted by the dataflow block.如果设置为DataflowBlockOptions.Unbounded ( -1 ),则可以同时处理任意数量的消息,最大数量由数据流块所针对的底层调度程序自动管理。

As I can see in memory consumption, the task scheduler decides to start new threads for each message as there is no available in the thread pool, which takes about 1MB per message, so you have a about a 27000 threads which are fighting each other to the CPU time.正如我在内存消耗中看到的那样,任务调度程序决定为每条消息启动新线程,因为线程池中没有可用线程,每条消息大约需要1MB ,因此您有大约27000线程相互争斗CPU时间。 And as you can see, they aren't good at that.正如你所看到的,他们并不擅长这一点。
The recommended degree of parallelism usually is a Environment.ProcessorCount , so if you want to speed up your one block, you can set the MaxDegreeOfParallelism to this property.推荐的并行度通常是Environment.ProcessorCount ,所以如果你想加速你的一个块,你可以将MaxDegreeOfParallelism设置为这个属性。 However, in more complicated scenarios this can be not always the best option, as other blocks will stop in waiting for a CPU time.然而,在更复杂的情况下,这并不总是最好的选择,因为其他块将停止等待 CPU 时间。

So what is your reasonable expectations for this?那么您对此的reasonable expectations是什么?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM