简体   繁体   中英

How to parallelize a list using multiple ActionBlocks?

I have a large folder structure that I'm trying to download from a shared drive. The shared drive is slow, but it also has several mirrors. To speed up the process I'm trying to make a little downloader app that manages parallel connections to all of the slow mirrors. Individual files would get downloaded from different mirrors. I'd also like to be able to limit the number of threads connecting to each mirror at one time. (Does this already exist? Then I don't have to write any code. I did look though.)

This seems like it might be a Dataflow use case, though I'm very new to Dataflow so I'm not positive. I started with something like this:

var buffer = new BufferBlock<string>();
var blockOptions = new ExecutionDataflowBlockOptions
{
    MaxDegreeOfParallelism = threadsPerPath
};

IEnumerable<ActionBlock<string>> blocks = mirrors.Select(basePath =>
{
    return new ActionBlock<string>(
        file => {
            string destinationFile = Path.Combine(destination, file);
            Directory.CreateDirectory(Path.GetDirectoryName(destinationFile));
            File.Copy(Path.Combine(basePath, file), destinationFile);
        },
        blockOptions);
});

foreach (ActionBlock<string> block in blocks)
{
    buffer.LinkTo(block);
}

await Task.Run(() =>
{
    string top = mirrors[0];

    int baseLength = top.Length;

    IEnumerable<string> allFiles = Directory.EnumerateFiles(top, "*", SearchOption.AllDirectories);

    foreach (string path in allFiles)
    {
        buffer.Post(path[baseLength..]);
    }

    buffer.Complete();
});

(I plan on playing around with threadsPerPath . Not sure whether I will see gains from parallelizing access to the same mirror.) When run, this only uses the first mirror - as far as I can tell the ActionBlocks for the other mirrors never get data. I gather this is by design, but I'm not sure how else to do this. How can I get several ActionBlocks to process the same buffer in parallel, where each item in the buffer only goes to one of the ActionBlocks?

One way to limit the parallelization across multiple ActionBlock<T> s, is to configure all of them with the ConcurrentScheduler of the same ConcurrentExclusiveSchedulerPair instance:

var sharedScheduler = new ConcurrentExclusiveSchedulerPair(
    TaskScheduler.Default, maxConcurrencyLevel: 50).ConcurrentScheduler;

var blockOptions = new ExecutionDataflowBlockOptions
{
    MaxDegreeOfParallelism = threadsPerPath,
    TaskScheduler = sharedScheduler
};

This will work only if the action of the ActionBlock<T> is synchronous, as it is in the code that you've posted. You can't limit the concurrency of asynchronous work with the TaskScheduler option. In that case you would have to use a shared SemaphoreSlim inside the action .

To implement load balancing you need to limit the input capacity of the linked blocks. If you want to spread requests to all blocks you'll have to limit their BoundedCapacity, perhaps even to 1 or at least the same number as MaxDegreeOfParallelism

var blockOptions = new ExecutionDataflowBlockOptions
{
    MaxDegreeOfParallelism = threadsPerPath,
    BoundedCapacity=bound
};

The problem with the original code is that all ActionBlock s have infinite input buffers, so all messages (files) end up in the first block.

A block's output is sent to the first available linked block. If that block has an infinite buffer, all messages will be sent to the same block. By limiting all blocks' input buffer to just 1 item you force each message (file in this case) to be sent to a different block.

Once the messages are posted you can await all blocks to complete with:

await Task.WhenAll(blocks.Select(b=>b.Completion));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM