I have a data flow as follows.
1. A task that reads a text file in chunks and adds them to BatchBlock<chunkSize>
2. An ActionBlock
that is linked to the above BatchBlock
that partitions the data into batches and adds them to a BufferBlock
3. A TransformationBlock
that is linked to BufferBlock
, which spawns an async
task for each batch
4. The process is finished when all the spanwed async
calls are finished.
The below code isn't working as expected. It finishes before all batches are processed. What am I missing?
private static void DataFlow(string filePath, int chunkSize, int batchSize)
{
int chunkCount = 0;
int batchCount = 0;
BatchBlock<string> chunkBlock = new BatchBlock<string>(chunkSize);
BufferBlock<IEnumerable<string>> batchBlock = new BufferBlock<IEnumerable<string>>();
Task produceTask = Task.Factory.StartNew(() =>
{
foreach (var line in File.ReadLines(filePath))
{
chunkBlock.Post(line);
}
Console.WriteLine("Finished producing");
chunkBlock.Complete();
});
var makeBatches = new ActionBlock<string[]>(t =>
{
Console.WriteLine("Got a chunk " + ++chunkCount);
// Partition each chunk into smaller chunks grouped on column 1
var partitions = t.GroupBy(c => c.Split(',')[0], (key, g) => g);
// Further beakdown the chunks into batch size groups
var groups = partitions.Select(x => x.Select((i, index) => new { i, index }).GroupBy(g => g.index / batchSize, e => e.i));
// Get batches from groups
var batches = groups.SelectMany(x => x).Select(y => y.Select(z => z));
foreach (var batch in batches)
{
batchBlock.Post(batch);
}
batchBlock.Complete();
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 1 });
chunkBlock.LinkTo(makeBatches, new DataflowLinkOptions { PropagateCompletion = true });
var executeBatches = new TransformBlock<IEnumerable<string>, IEnumerable<string>>(async b =>
{
Console.WriteLine("Got a batch " + ++batchCount);
await ExecuteBatch(b);
return b;
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded });
batchBlock.LinkTo(executeBatches, new DataflowLinkOptions { PropagateCompletion = true });
var finishBatches = new ActionBlock<IEnumerable<string>>(b =>
{
Console.WriteLine("Finised executing batch" + batchCount);
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded });
executeBatches.LinkTo(finishBatches, new DataflowLinkOptions { PropagateCompletion = true });
Task.WaitAll(produceTask);
Console.WriteLine("Production complete");
makeBatches.Completion.Wait();
Console.WriteLine("Making batches complete");
executeBatches.Completion.Wait();
Console.WriteLine("Executing batches complete");
Task.WaitAll(finishBatches.Completion);
Console.WriteLine("Process complete with total chunks " + chunkCount + " and total batches " + batchCount);
Console.ReadLine();
}
// async task to simulate network I/O
private static async Task ExecuteBatch(IEnumerable<string> batch)
{
Console.WriteLine("Executing batch ");
await Task.Run(() => System.Threading.Thread.Sleep(2000));
}
chunkBlock
呼吁makeBatches
与每个块,和你打电话batchBlock.Complete()
内makeBatches
,所以它退出接受新职位的第一批后。
You're mixing here completion propagation and direct flow. Your makeBatches
and executeBatches
aren't linked between each other, which is bad. But the real issue here is this lines:
foreach (var batch in batches)
{
batchBlock.Post(batch);
}
// this line stops the batchBlock to accept any news messages
batchBlock.Complete();
You do complete the batchBlock
after the first batch, after this line it will not accept any other messages. As @DaxFohl said, you need to change your makeBatches
from ActionBlock<string>
to TransformManyBlock<string, IEnumerable<string>>
(so you'll provide many chunks), and after that link it to the next block:
var makeBatches = new TransformManyBlock<string[], IEnumerable<string>>(t =>
{
Console.WriteLine("Got a chunk " + ++chunkCount);
// Partition each chunk into smaller chunks grouped on column 1
var partitions = t.GroupBy(c => c.Split(',')[0], (key, g) => g);
// Further beakdown the chunks into batch size groups
var groups = partitions.Select(x => x.Select((i, index) => new { i, index }).GroupBy(g => g.index / batchSize, e => e.i));
// Get batches from groups
return groups.SelectMany(x => x).Select(y => y.Select(z => z));
});
makeBatches.LinkTo(executeBatches, new DataflowLinkOptions { PropagateCompletion = true });
Some other thoughts: You do not need to provide a MaxDegreeOfParallelism = 1
, as this is default value for it , and, maybe, you can use a BatchBlock
for chunking your strings. A little bit late, but still better than never :)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.