简体   繁体   English

使用Rx和Await来逐行完成读取文件的异常

[英]Exception using Rx and Await to accomplish reading file line by line async

I am learning to use RX and tried this sample. 我正在学习使用RX并试用这个样本。 But could not fix the exception that happens in the highlighted while statement - while(!f.EndofStream) 但是无法修复突出显示的while语句中发生的异常 - 而(!f.EndofStream)

I want to read a huge file - line by line - and for every line of data - I want to do some processing in a different thread (so I used ObserverOn) I want the whole thing async. 我想逐行读取一个巨大的文件 - 对于每一行数据 - 我想在不同的线程中进行一些处理(所以我使用了ObserverOn)我希望整个事情都是异步的。 I want to use ReadLineAsync since it returns TASK and so I can convert that to Observables and subscribe to it. 我想使用ReadLineAsync,因为它返回TASK,所以我可以将它转换为Observables并订阅它。

I guess the task thread which I create first, gets in between the Rx threads. 我想我首先创建的任务线程介于Rx线程之间。 But even if I use Observe and Subscribe using the currentThread, I still cannot stop the exception. 但即使我使用currentThread使用Observe和Subscribe,我仍然无法阻止异常。 Wonder how I do accomplish this neatly Aysnc with Rx. 不知道我是如何用Rx完成这个整齐的Aysnc。

Wondering if the whole thing could be done even simpler ? 想知道整件事情是否可以做得更简单?

    static void Main(string[] args)
    {
        RxWrapper.ReadFileWithRxAsync();
        Console.WriteLine("this should be called even before the file read begins");
        Console.ReadLine();
    }

    public static async Task ReadFileWithRxAsync()
    {
        Task t = Task.Run(() => ReadFileWithRx());
        await t;
    }


    public static void ReadFileWithRx()
    {
        string file = @"C:\FileWithLongListOfNames.txt";
        using (StreamReader f = File.OpenText(file))
        {
            string line = string.Empty;
            bool continueRead = true;

            ***while (!f.EndOfStream)***
            {
                f.ReadLineAsync()
                       .ToObservable()
                       .ObserveOn(Scheduler.Default)
                       .Subscribe(t =>
                           {
                               Console.WriteLine("custom code to manipulate every line data");
                           });
            }

        }
    }

The exception is an InvalidOperationException - I'm not intimately familiar with the internals of FileStream, but according to the exception message this is being thrown because there is an in-flight asynchronous operation on the stream. 异常是InvalidOperationException - 我并不熟悉FileStream的内部结构,但是根据异常消息,这是因为在流上有一个正在进行的异步操作。 The implication is that you must wait for any ReadLineAsync() calls to finish before checking EndOfStream . 这意味着在检查EndOfStream之前必须等待任何ReadLineAsync()调用完成。

Matthew Finlay has provided a neat re-working of your code to solve this immediate problem. Matthew Finlay为您的代码提供了一个巧妙的重新编写,以解决这个直接的问题。 However, I think it has problems of its own - and that there is a bigger issue that needs to be examined. 但是,我认为它有自己的问题 - 而且还有一个更大的问题需要加以研究。 Let's look at the fundamental elements of the problem: 让我们看一下问题的基本要素:

  • You have a very large file. 你有一个非常大的文件。
  • You want to process it asynchronously. 您想要异步处理它。

This suggests that you don't want the whole file in memory, you want to be informed when the processing is done, and presumably you want to process the file as fast as possible. 这表明您不希望整个文件在内存中,您希望在处理完成时得到通知,并且可能您希望尽快处理该文件。

Both solutions are using a thread to process each line (the ObserveOn is passing each line to a thread from the thread pool). 两个解决方案都使用一个线程来处理每一行( ObserveOn将每一行传递给线程池中的一个线程)。 This is actually not an efficient approach. 这实际上不是一种有效的方法。

Looking at both solutions, there are two possibilities: 看看这两种解决方案,有两种可能性:

  • A. It takes more time on average to read a file line than it does to process it. A.平均需要更多的时间来阅读比它处理它一个文件行。
  • B. It takes less time on average to read a file line than it does to process it. B.读取文件行平均花费的时间少于处理文件行所需的时间。

A. File read of a line slower than processing a line A.文件读取的行比处理行慢

In the case of A, the system will basically spend most of it's time idle while it waits for file IO to complete. 在A的情况下,系统在等待文件IO完成时基本上将花费大部分时间空闲。 In this scenario, Matthew's solution won't result in memory filling up - but it's worth seeing if using ReadLines directly in a tight loop produces better results due to less thread contention. 在这种情况下,Matthew的解决方案不会导致内存填满 - 但值得注意的是,如果在紧密循环中直接使用ReadLines会因较少的线程争用而产生更好的结果。 ( ObserveOn pushing the line to another thread will only buy you something if ReadLines isn't getting lines in advance of calling MoveNext - which I suspect it does - but test and see!) (如果ReadLines在调用MoveNext之前没有得到线路, ObserveOn将线路推到另一个线程只会给你带来一些东西 - 我怀疑它是这样 - 但是测试看看!)

B. File read of a line faster than processing a line B.文件读取行比处理行更快

In the case of B (which I assume is more likely given what you have tried), all those lines will start to queue up in memory and, for a big enough file, you will end up with most of it in memory. 在B的情况下(我假设更有可能给出您尝试的内容),所有这些行将开始在内存中排队,并且对于足够大的文件,您将最终在内存中占据大部分。

You should note that unless your handler is firing off asynchronous code to process a line, then all lines will be processed serially because Rx guarantees OnNext() handler invocations won't overlap. 您应该注意,除非你的处理程序触发异步代码来处理一行,否则所有行都将被串行处理,因为Rx保证OnNext()处理程序调用不会重叠。

The ReadLines() method is great because it returns an IEnumerable<string> and it's your enumeration of this that drives reading the file. ReadLines()方法很棒,因为它返回一个IEnumerable<string> ,它是你驱动读取文件的枚举。 However, when you call ToObservable() on this, it will enumerate as fast as possible to generate the observable events - there is no feedback (known as "backpressure" in reactive programs) in Rx to slow down this process. 但是,当您在此上调用ToObservable()时,它将尽可能快地枚举以生成可观察事件 - 在Rx中没有反馈(在反应程序中称为“背压”)以减慢此过程。

The problem is not the ToObservable itself - it's the ObserveOn . 问题不在于ToObservable本身 - 它是ObserveOn ObserveOn doesn't block the OnNext() handler it is invoked on waiting until it's subscribers are done with the event - it queues up events as fast as possible against the target scheduler. ObserveOn不会阻止在等待订阅者完成事件之前调用它的OnNext()处理程序 - 它会尽可能快地将事件排队到目标调度程序。

If you remove the ObserveOn , then - as long as your OnNext handler is synchronous - you'll see each line is read and processed one at a time because the ToObservable() is processing the enumeration on the same thread as the handler. 如果删除ObserveOn ,那么 - 只要OnNext处理程序是同步的 - 您将看到每行读取和处理一行,因为ToObservable()正在处理与处理程序相同的线程上的枚举。

If this isn't want you want, and you attempt to mitigate this in pursuit of parallel processing by firing an async job in the subscriber - eg Task.Run(() => /* process line */ or similar - then things won't go as well as you hope. 如果这不是您想要的,并且您尝试通过在订阅者中触发异步作业来追求并行处理来缓解这种情况 - 例如Task.Run(() => /* process line */或类似 - 然后赢得了胜利不如你希望的那样好。

Because it takes longer to process a line than read a line, you will create more and more tasks that aren't keeping pace with the incoming lines. 由于处理线路比读取线路需要更长的时间,因此您将创建越来越多的与传入线路保持同步的任务。 The thread count will gradually increase and you will be starving the thread pool. 线程数将逐渐增加,您将使线程池挨饿。

In this case, Rx isn't a great fit really. 在这种情况下,Rx真的不太适合。

What you probably want is a small number of worker threads (probably 1 per processor core) that fetch a line of code at a time to work on, and limit the number of lines of the file in memory. 您可能需要的是少量工作线程(每个处理器核心可能有1个),它们一次获取一行代码,并限制内存中文件的行数。

A simple approach could be this, which limits the number of lines in memory to a fixed number of workers. 一种简单的方法可以是这种方法,它将内存中的行数限制为固定数量的工作者。 It's a pull-based solution, which is a much better design in this scenario: 这是一个基于拉式的解决方案,在这种情况下这是一个更好的设计:

private Task ProcessFile(string filePath, int numberOfWorkers)
{
    var lines = File.ReadLines(filePath);       

    var parallelOptions = new ParallelOptions {
        MaxDegreeOfParallelism = numberOfWorkers
    };  

    return Task.Run(() => 
        Parallel.ForEach(lines, parallelOptions, ProcessFileLine));
}

private void ProcessFileLine(string line)
{
    /* Your processing logic here */
    Console.WriteLine(line);
}

And use it like this: 并像这样使用它:

static void Main()
{       
    var processFile = ProcessFile(
        @"C:\Users\james.world\Downloads\example.txt", 8);

    Console.WriteLine("Processing file...");        
    processFile.Wait();
    Console.WriteLine("Done");
}

Final Notes 最后的笔记

There are ways of dealing with back pressure in Rx (search around SO for some discussions) - but it's not something that Rx handles well, and I think the resulting solutions are less readable than the alternative above. 有一些方法可以处理Rx中的背压(搜索SO以进行一些讨论) - 但这并不是Rx处理得好的东西,我认为最终的解决方案比上面的替代解决方案更不易读。 There are also many other approaches that you can look at (actor based approaches such as TPL Dataflows, or LMAX Disruptor style ring-buffers for high-performance lock free approaches) but the core idea of pulling work from queues will be prevalent. 您还可以查看许多其他方法(基于actor的方法,如TPL Dataflows,或LMAX Disruptor样式的环形缓冲区,用于高性能无锁方法),但从队列中提取工作的核心思想将会很普遍。

Even in this analysis, I am conveniently glossing over what you are doing to process the file, and tacitly assuming that the processing of each line is compute bound and truly independent. 即使在这个分析中,我也很方便地掩盖你正在做什么来处理文件,并且默认假设每行的处理是计算绑定的并且是真正独立的。 If there is work to merge the results and/or IO activity to store the output then all bets are off - you will need to examine the efficiency of this side of things carefully too. 如果有工作要合并结果和/或IO活动来存储输出,那么所有的赌注都是关闭的 - 你还需要仔细检查这方面的效率。

In most cases where performing work in parallel as an optimization is under consideration, there are usually so many variables in play that it is best to measure the results of each approach to determine what is best. 在大多数情况下,正在考虑并行执行工作时,通常会有很多变量,因此最好测量每种方法的结果以确定最佳方法。 And measuring is a fine art - be sure to measure realistic scenarios, take averages of many runs of each test and properly reset the environment between runs (eg to eliminate caching effects) in order to reduce measurement error. 测量是一门艺术 - 确保测量真实场景,平均每次测试的多次运行并在运行之间正确地重置环境(例如消除缓存效应)以减少测量误差。

I haven't looked into what is causing your exception, but I think the neatest way to write this is: 我没有查看导致您的异常的原因,但我认为最好的方法是:

File.ReadLines(file)
  .ToObservable()
  .ObserveOn(Scheduler.Default)
  .Subscribe(Console.Writeline);

Note: ReadLines differs from ReadAllLines in that it will start yielding without having read the entire file, which is the behavior that you want. 注意: ReadLinesReadAllLines的不同之处在于它将在不读取整个文件的情况下开始产生,这是您想要的行为。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM