简体   繁体   English

c# 中的并行任务性能

[英]Parallel tasks performance in c#

I need to make Tasks run faster, I tried to use semaphore, parallel library and threads(tried to open one for every work, I know its the most dumb thing to do), but none of them show the performance I need.我需要让任务运行得更快,我尝试使用信号量、并行库和线程(尝试为每项工作打开一个,我知道这是最愚蠢的做法),但它们都没有显示出我需要的性能。 I'm not familiar to work with thread stuff and I need some help to find the right way and understand how Task and Threads work.我不熟悉使用线程的东西,我需要一些帮助来找到正确的方法并了解任务和线程是如何工作的。

Here is the function:这是 function:

 public class Test
    {
        public void openThreads()
        {
            int maxConcurrency = 500;
            var someWork = get_data_from_database();
            using (SemaphoreSlim concurrencySemaphore = new SemaphoreSlim(maxConcurrency))
            {
                List<Task> tasks = new List<Task>();
                foreach (var work in someWork)
                {
                    concurrencySemaphore.Wait();

                    var t = Task.Factory.StartNew(() =>
                    {
                        try
                        {
                            ScrapThings(work);
                        }
                        finally
                        {
                            concurrencySemaphore.Release();
                        }
                    });

                    tasks.Add(t);
                }

                Task.WaitAll(tasks.ToArray());
            }
        }

        public async Task ScrapThings(Object work)
        {
            HttpClient client = new HttpClient();
            Encoding utf8 = Encoding.UTF8;
            var response = client.GetAsync(work.url).Result;
            var buffer = response.Content.ReadAsByteArrayAsync().Result;
            string content = utf8.GetString(buffer);
            /*
             Do some parse operations, load html document, get xpath, split things, etc 
             */

            while(true) // this loop runs from 1~15 times
            {
                response = client.GetAsync(work.anotherUrl).Result;
                buffer = response.Content.ReadAsByteArrayAsync().Result;
                content = utf8.GetString(buffer);
                if (content == "OK")
                    break;

                await Task.Delay(10000); //I need some throttle here before it tries again
            }
            /*
                Do some parse operations, load html document, get xpath, split things, etc 
                */
            update_things_in_database();
        }
    }

I want to make this task run 500 times in parallel, all the operation takes 18 hours to complete and I need to decrease this, I'm using xeon with 32 cores/64 threads.我想让这个任务并行运行 500 次,所有操作需要 18 小时才能完成,我需要减少这个,我使用的是 32 核/64 线程的至强。 I tried to open 500 threads (better performance comparing to semaphore and parallel library) but it doesnt feel the right way to do.我尝试打开 500 个线程(与信号量和并行库相比性能更好),但感觉不是正确的做法。

I would say problem with performance is not with how you run your threads, but how individual threads are performing.我想说性能问题不在于您如何运行线程,而在于各个线程的执行方式。 Depended on version of .NET/libraries you are using there are few possible issues.根据您使用的 .NET/libraries 版本,可能存在的问题很少。

  1. You should reuse HttpClient instances, for reasons explained here for example.您应该重用HttpClient实例,例如 这里解释的原因。
  2. If work.url and work.anotherUrl use the same subset of domains you should look into connection limit per endpoint (and total also).如果work.urlwork.anotherUrl使用相同的域子集,您应该查看每个端点的连接限制(以及总数)。 Depended on version either HttpClientHandler.MaxConnectionsPerServer or ServicePoint.ConnectionLimit and ServicePointManager.DefaultConnectionLimit .取决于HttpClientHandler.MaxConnectionsPerServerServicePoint.ConnectionLimitServicePointManager.DefaultConnectionLimit的版本。 The former one is for .NET Core and latter for .NET Full framework .前者用于 .NET 核心,后者用于.NET 完整框架

The recommended approach to solve the first issue is to use IHttpClientFactory解决第一个问题的推荐方法是使用IHttpClientFactory

And some more info .还有更多 信息

UPD UPD

You mentioned in comments that you are using .NET 4.7.2 so I would suggest to start with adding next lines to your application (somewhere at the start):您在评论中提到您正在使用 .NET 4.7.2,因此我建议您首先在您的应用程序中添加下一行(在开头的某个位置):

ServicePointManager.DefaultConnectionLimit = 500;
// if you can get collection of most scrapped ones:
var domains = new [] { "http://slowwly.robertomurray.co.uk" };
foreach(var d in domains)
{
    var delayServicePoint = ServicePointManager.FindServicePoint(new Uri(d));
    delayServicePoint.ConnectionLimit = 10; // or bigger
}

This sounds like a job for the TPL Dataflow library.这听起来像是TPL Dataflow库的工作。 You probably need different concurrency levels for the I/O bound operations (web requests, database updates) and the CPU-bound operations (parsing of the data).对于 I/O 绑定操作(Web 请求、数据库更新)和 CPU 绑定操作(数据解析),您可能需要不同的并发级别。 The TPL Dataflow allows to build a pipeline where each block is responsible for a single operation, and the data flows from one block to the next. TPL 数据流允许构建一个管道,其中每个块负责一个操作,数据从一个块流向下一个块。 It even allows for cyclic graphs, so for example you are allowed to throw a failed data element back into the block, so that it can be processed again.它甚至允许循环图,例如,您可以将失败的数据元素扔回块中,以便可以再次处理它。

For some examples of using this library, look here , here or here .有关使用此库的一些示例,请查看此处此处此处

The TPL Dataflow library is embedded in .NET Core, and available as apackage for .NET Framework. TPL 数据流库嵌入在 .NET 核心中,并作为.NET框架的 package 提供。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM