简体   繁体   English

TPL Parallel.Foreach具有IO和计算密集型任务

[英]TPL Parallel.Foreach with IO and compute intensive tasks

I have billions of xml log files on Azure blob storage to be processed, queried and results storage. 我在Azure Blob存储上有数十亿个xml日志文件,需要处理,查询和存储结果。 I am using Parallel.Foreach as processing of files is independent of each other. 我正在使用Parallel.Foreach,因为文件的处理彼此独立。

Parallel.ForEach<String> (listOfFeatureFiles, file => { 
  //For each file that was created
  string fileName = file;
  string directoryPath = outputfolderPath + "/" + FeatureFolderName;
  string finalFilePath = directoryPath + "/" + fileName;

  DownloadContent();
  XMLParseAndQueryData();
  UploadResultToQueue();
  DeleteLocalCopy();
});

If it were only compute intensive then i might had maximum CPU usage however with my scenario 20% of files are much bigger (in GBs) as compared to rest 80% of files. 如果只是计算密集型的话,那么我可能拥有最大的CPU使用率,但是在我的场景中,文件的20%更大(以GB为单位),而其余的文件则为80%。 This usually results in only 50% CPU usage with 4 cores. 通常,使用4个内核只能导致50%的CPU使用率。 How can i optimize it to make maximum CPU usage ie > 90% ? 我如何优化它以使CPU使用率最大化,即> 90%?

My assumption is that once a task is downloading big files, no cpu is used however no new thread is created in the meantime as well which could make use of processing power. 我的假设是,一旦任务下载了大文件,就不会使用cpu,但是同时也不会创建新线程,这可能会利用处理能力。 I might be wrong about this assumption and will appreciate a concrete link to its negation. 对于这个假设,我可能是错的,并且会赞赏与其否定的具体联系。

I built a similar application for one of my customers that also processes lots of xml files with varying sizes. 我为一个客户构建了一个类似的应用程序,该应用程序还处理许多大小不同的xml文件。 The downloading will interfere with CPU usage, you can't help that. 下载会干扰CPU使用率,您无济于事。 But you might optimize CPU usage by using a BlockingCollection with multiple consumers and always keep processing smaller files while a larger file is being downloaded. 但是,您可以通过将BlockingCollection与多个使用者一起使用来优化CPU使用率,并始终在下载较大文件时继续处理较小文件。

My assumption is that once a task is downloading big files, no cpu is used however no new thread is created in the meantime as well which could make use of processing power. 我的假设是,一旦任务下载了大文件,就不会使用cpu,但是同时也不会创建新线程,这可能会利用处理能力。

Are you sure you have enough network bandwidth and that downloading the files is not actually the bottleneck of of this process? 您确定您有足够的网络带宽,并且下载文件实际上不是此过程的瓶颈吗?

If you are, and the slow adding of threads is actually what is slowing you down, then the quick and dirty solution would be to force the ThreadPool (which is used by Parallel.ForEach() internally) to have more threads. 如果是这样,并且线程的缓慢添加实际上会使您减速,那么快速而肮脏的解决方案将是强制ThreadPool (内部由Parallel.ForEach()使用)具有更多线程。 You can do that by calling ThreadPool.SetMinThreads . 您可以通过调用ThreadPool.SetMinThreads

The proper solution would be to make the IO-bound methods asynchronous and schedule them independently of the CPU-bound methods. 正确的解决方案是使与IO绑定的方法异步,并独立于与CPU绑定的方法调度它们。 To help with scheduling, you can use TPL Dataflow ( EnsureOrdered requires a prerelease version): 为了帮助进行调度,可以使用TPL Dataflow( EnsureOrdered需要预发行版本):

var cpuBoundOptions = new ExecutionDataflowBlockOptions
{
    MaxDegreeOfParallelism = Environment.ProcessorCount,
    EnsureOrdered = false
};

var ioBoundOptions = new ExecutionDataflowBlockOptions
{
    MaxDegreeOfParallelism = 10, // TODO: tweak this value as necessary
    EnsureOrdered = false
};

var downloadBlock = new TransformBlock<string, string>(async file =>
{
    await DownloadContentAsync(file);
    return file;
}, ioBoundOptions);

var parseBlock = new TransformBlock<string, string>(file =>
{
    XMLParseAndQueryData(file);
    return file;
}, cpuBoundOptions);

var uploadBlock = new TransformBlock<string, string>(async file =>
{
    await UploadResultToQueue(file);
    return file;
}, ioBoundOptions);

var deleteBlock = new ActionBlock<string>(file => DeleteLocalCopy(file));

var linkOptions = new DataflowLinkOptions { PropagateCompletion = true };

downloadBlock.LinkTo(parseBlock, linkOptions);
parseBlock.LinkTo(uploadBlock, linkOptions);
uploadBlock.LinkTo(deleteBlock, linkOptions);

foreach (var file in listOfFeatureFiles)
{
    downloadBlock.Post(file);
}

downloadBlock.Complete();
await deleteBlock.Completion;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM