简体   繁体   English

TPL 数据流:避免在每次调用其委托时重复运行 using 块(例如写入 StreamWriter)的 ActionBlock

[英]TPL Dataflow: ActionBlock that avoids repeatedly running a using-block (such as for writing to a StreamWriter) on every invocation of its delegate

I need to read 1M rows from an IDataReader, and write n text files simultaneously.我需要从 IDataReader 读取 1M 行,并同时写入n 个文本文件。 Each of those files will be a different subset of the available columns;这些文件中的每一个都将是可用列的不同子集; all n text files will be 1M lines long when complete.完成后,所有n 个文本文件的长度为 1M 行。

Current plan is one TransformManyBlock to iterate the IDataReader, linked to a BroadcastBlock, linked to n BufferBlock/ActionBlock pairs.当前计划是一个 TransformManyBlock 来迭代 IDataReader,链接到一个 BroadcastBlock,链接到n 个BufferBlock/ActionBlock 对。

What I'm trying to avoid is having my ActionBlock delegate perform a using (StreamWriter x...) { x.WriteLine(); }我试图避免的是让我的 ActionBlock 委托执行using (StreamWriter x...) { x.WriteLine(); } using (StreamWriter x...) { x.WriteLine(); } that would open and close every output file a million times over. using (StreamWriter x...) { x.WriteLine(); }这将打开一百万次以上每次关闭输出文件。

My current thought is in lieu of ActionBlock, write a custom class that implements ITargetBlock<>.我目前的想法是代替 ActionBlock,编写一个实现 ITargetBlock<> 的自定义类。 Is there is a simpler approach?有没有更简单的方法?

EDIT 1: The discussion is of value for my current problem, but the answers so far got hyper focused on file system behavior.编辑 1:讨论对我当前的问题很有价值,但到目前为止的答案都集中在文件系统行为上。 For the benefit of future searchers, the thrust of the question was how to build some kind of setup/teardown outside the ActionBlock delegate .为了未来搜索者的利益,问题的重点是如何在 ActionBlock 委托之外构建某种设置/拆卸 This would apply to any kind of disposable that you would ordinarily wrap in a using-block.这适用于您通常包装在 using 块中的任何类型的一次性物品。

EDIT 2: Per @Panagiotis Kanavos the executive summary of the solution is to setup the object before defining the block, then teardown the object in the block's Completion.ContinueWith .编辑2:根据@Panagiotis Kanavos,解决方案的执行摘要是在定义块之前设置对象,然后在块的 Completion.ContinueWith 中拆除对象

通常在使用 TPL 时,我会创建自定义类,这样我就可以创建用于管道中块的私有成员变量和私有方法,但不是实现ITargetBlockISourceBlock ,我只会在我的自定义中包含我需要的任何块类,然后我将ITargetBlock和/或ISourceBlock作为公共属性公开,以便其他类可以使用源块和目标块将事物链接在一起。

Writing to a file one line at a time is expensive in itself even when you don't have to open the stream each time.即使您不必每次都打开流,一次写入一行文件本身就很昂贵。 Keeping a file stream open has other issues too, as file streams are always buffered, from the FileStream level all the way down to the file system driver, for performance reasons.保持文件流打开也有其他问题,因为出于性能原因,文件流总是被缓冲,从FileStream级别一直到文件系统驱动程序。 You'd have to flush the stream periodically to ensure the data was written to disk.您必须定期刷新流以确保将数据写入磁盘。

To really improve performance you'd have to batch the records, eg with a BatchBlock.要真正提高性能,您必须对记录进行批处理,例如使用 BatchBlock。 Once you do that, the cost of opening the stream becomes negligible.一旦你这样做了,打开流的成本就可以忽略不计了。

The lines should be generated at the last possible moment too, to avoid generating temporary strings that will need to be garbage collected.这些行也应该在最后可能的时刻生成,以避免生成需要被垃圾收集的临时字符串。 At n*1M records, the memory and CPU overhead of those allocations and garbage collections would be severe.在 n*1M 记录时,这些分配和垃圾收集的内存和 CPU 开销会很严重。

Logging libraries batch log entries before writing to avoid this performance hit.在写入之前记录库批处理日志条目以避免这种性能下降。

You can try something like this :你可以尝试这样的事情:

var batchBlock=new BatchBlock<Record>(1000);
var writerBlock=new ActionBlock<Record[]>( records => {
   
    //Create or open a file for appending
    using var writer=new StreamWriter(ThePath,true);
    foreach(var record in records)
    {
        writer.WriteLine("{0} = {1} :{2}",record.Prop1, record.Prop5, record.Prop2);
    }

});

batchBlock.LinkTo(writerBlock,options);

or, using asynchronous methods或者,使用异步方法

var batchBlock=new BatchBlock<Record>(1000);
var writerBlock=new ActionBlock<Record[]>(async records => {
   
    //Create or open a file for appending
    await using var writer=new StreamWriter(ThePath,true);
    foreach(var record in records)
    {
        await writer.WriteLineAsync("{0} = {1} :{2}",record.Prop1, record.Prop5, record.Prop2);
    }

});

batchBlock.LinkTo(writerBlock,options);

You can adjust the batch size and the StreamWriter's buffer size for optimum performance.您可以调整批处理大小和 StreamWriter 的缓冲区大小以获得最佳性能。

Creating an actual "Block" that writes to a stream创建写入流的实际“块”

A custom block can be created using the technique shown in the Custom Dataflow block walkthrough - instead of creating an actual custom block, create something that returns whatever is needed for LinkTo to work, in this case an ITargetBlock< T> :可以使用自定义数据流块演练中显示的技术创建自定义块- 而不是创建实际的自定义块,而是创建返回LinkTo工作所需的任何内容的LinkTo ,在本例中为ITargetBlock< T>

ITargetBlock<Record> FileExporter(string path)
{
    var writer=new StreamWriter(path,true);
    var block=new ActionBlock<Record>(async msg=>{
        await writer.WriteLineAsync("{0} = {1} :{2}",record.Prop1, record.Prop5, record.Prop2);
    });

    //Close the stream when the block completes
    block.Completion.ContinueWith(_=>write.Close());
    return (ITargetBlock<Record>)target;
}
...


var exporter1=CreateFileExporter(path1);
previous.LinkTo(exporter,options);

The "trick" here is that the stream is created outside the block and remains active until the block completes.这里的“技巧”是在块外创建流并保持活动状态直到块完成。 It's not garbage-collected because it's used by other code.它不会被垃圾收集,因为它被其他代码使用。 When the block completes, we need to explicitly close it, no matter what happened.当块完成时,无论发生什么,我们都需要显式关闭它。 block.Completion.ContinueWith(_=>write.Close()); will close the stream whether the block completed gracefully or not.无论块是否正常完成,都将关闭流。

This is the same code used in the Walkthrough, to close the output BufferBlock :这与演练中使用的代码相同,用于关闭输出 BufferBlock :

target.Completion.ContinueWith(delegate
{
   if (queue.Count > 0 && queue.Count < windowSize)
      source.Post(queue.ToArray());
   source.Complete();
});

Streams are buffered by default, so calling WriteLine doesn't mean the data will actually be written to disk.默认情况下,流是缓冲的,因此调用WriteLine并不意味着数据将实际写入磁盘。 This means we don't know when the data will actually be written to the file.这意味着我们不知道数据何时会真正写入文件。 If the application crashes, some data may be lost.如果应用程序崩溃,一些数据可能会丢失。

Memory, IO and overheads内存、IO 和开销

When working with 1M rows over a significant period of time, things add up.在很长一段时间内处理 100 万行时,事情会增加。 One could use eg File.AppendAllLinesAsync to write batches of lines at once, but that would result in the allocation of 1M temporary strings.可以使用例如File.AppendAllLinesAsync一次写入一批行,但这会导致分配 1M 临时字符串。 At each iteration, the runtime would have to use at least as RAM for those temporary strings as the batch.在每次迭代中,运行时必须至少将这些临时字符串用作 RAM 作为批处理。 RAM usage would start ballooning to hundreds of MBs, then GBs, before the GC fired freezing the threads.在 GC 启动冻结线程之前,RAM 使用量会开始膨胀到数百 MB,然后是 GB。

With 1M rows and lots of data it's hard to debug and track data in the pipeline.对于 100 万行和大量数据,很难调试和跟踪管道中的数据。 If something goes wrong, things can crash very quickly.如果出现问题,事情可能崩溃速度非常快。 Imagine for example 1M messages stuck in one block because one message got blocked.想象一下,例如 100 万条消息卡在一个块中,因为一条消息被阻止了。

It's important (for sanity and performance reasons) to keep individual components in the pipeline as simple as possible.重要的是(出于理智和性能原因)使管道中的各个组件尽可能简单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM