简体   繁体   English

多核文本文件解析

[英]Multicore Text File Parsing

I have a quad core machine and would like to write some code to parse a text file that takes advantage of all four cores.我有一台四核机器,想编写一些代码来解析一个利用所有四核的文本文件。 The text file basically contains one record per line.文本文件基本上每行包含一条记录。

Multithreading isn't my forte so I'm wondering if anyone could give me some patterns that I might be able to use to parse the file in an optimal manner.多线程不是我的强项,所以我想知道是否有人可以给我一些我可以用来以最佳方式解析文件的模式。

My first thoughts are to read all the lines into some sort of queue and then spin up threads to pull the lines off the queue and process them, but that means the queue would have to exist in memory and these are fairly large files so I'm not so keen on that idea.我的第一个想法是将所有行读入某种队列,然后启动线程以将行从队列中拉出并处理它们,但这意味着队列必须存在于 memory 中,这些文件相当大,所以我我不那么热衷于这个想法。

My next thoughts are to have some sort of controller that will read in a line and assign it a thread to parse, but I'm not sure if the controller will end up being a bottleneck if the threads are processing the lines faster than it can read and assign them.我的下一个想法是有某种 controller 将在一行中读取并为其分配一个线程来解析,但我不确定 controller 是否最终会成为瓶颈,如果线程处理行的速度比它更快阅读并分配它们。

I know there's probably another simpler solution than both of these but at the moment I'm just not seeing it.我知道可能还有比这两个更简单的解决方案,但目前我只是没有看到它。

Mark's answer is the simpler, more elegant solution. Mark 的回答是更简单、更优雅的解决方案。 Why build a complex program with inter-thread communication if it's not necessary?如果没有必要,为什么要构建具有线程间通信的复杂程序? Spawn 4 threads.产生 4 个线程。 Each thread calculates size-of-file/4 to determine it's start point (and stop point).每个线程计算文件大小/4 以确定它的起点(和终点)。 Each thread can then work entirely independently.然后每个线程可以完全独立地工作。

The only reason to add a special thread to handle reading is if you expect some lines to take a very long time to process and you expect that these lines are clustered in a single part of the file.添加一个特殊线程来处理读取的唯一原因是,如果您希望某些行需要很长时间来处理,并且您希望这些行聚集在文件的单个部分中。 Adding inter-thread communication when you don't need it is a very bad idea .在不需要时添加线程间通信是一个非常糟糕的主意 You greatly increase the chance of introducing an unexpected bottleneck and/or synchronization bugs.您大大增加了引入意外瓶颈和/或同步错误的机会。

I'd go with your original idea.我会用你原来的想法 go 。 If you are concerned that the queue might get too large implement a buffer-zone for it (ie If is gets above 100 lines the stop reading the file and if it gets below 20 then start reading again. You'd need to do some testing to find the optimal barriers).如果您担心队列可能会变得太大,请为它实现一个缓冲区(即如果超过 100 行则停止读取文件,如果它低于 20 行则重新开始读取。您需要进行一些测试找到最佳障碍)。 Make it so that any of the threads can potentially be the "reader thread" as it has to lock the queue to pull an item out anyway it can also check to see if the "low buffer region" has been hit and start reading again.使任何线程都可能成为“读取器线程”,因为它必须锁定队列才能将项目拉出无论如何它还可以检查“低缓冲区”是否已被击中并再次开始读取。 While it's doing this the other threads can read out the rest of the queue.在执行此操作时,其他线程可以读取队列的 rest。

Or if you prefer, have one reader thread assign the lines to three other processor threads (via their own queues) and implement a work-stealing strategy .或者,如果您愿意,让一个读取器线程将这些行分配给其他三个处理器线程(通过它们自己的队列)并实施工作窃取策略 I've never done this so I don't know how hard it is.我从来没有这样做过,所以我不知道这有多难。

This will eliminate bottlenecks of having a single thread do the reading:这将消除单线程读取的瓶颈:

open file
for each thread n=0,1,2,3:
    seek to file offset 1/n*filesize
    scan to next complete line
    process all lines in your part of the file

My experience is with Java, not C#, so apologies if these solutions don't apply.我的经验是 Java,而不是 C#,如果这些解决方案不适用,我们深表歉意。

The immediate solution I can think up off the top of my head would be to have an executor that runs 3 threads (using Executors .newFixedThreadPool , say).我能想到的直接解决方案是拥有一个运行 3 个线程的执行程序(例如,使用Executors .newFixedThreadPool )。 For each line/record read from the input file, fire off a job at the executor (using ExecutorService .submit ).对于从输入文件中读取的每一行/记录,在执行器上启动一个作业(使用ExecutorService .submit )。 The executor will queue requests for you, and allocate between the 3 threads.执行器将为您排队请求,并在 3 个线程之间分配。

Probably better solutions exist, but hopefully that will do the job.可能存在更好的解决方案,但希望能够完成这项工作。 :-) :-)

ETA: Sounds a lot like Wolfbyte's second solution. ETA:听起来很像 Wolfbyte 的第二个解决方案。 :-) :-)

ETA2: System.Threading.ThreadPool sounds like a very similar idea in .NET. ETA2: System.Threading.ThreadPool在 .NET 中听起来很相似。 I've never used it, but it may be worth your while!我从未使用过它,但它可能值得你花时间!

Since the bottleneck will generally be in the processing and not the reading when dealing with files I'd go with the producer-consumer pattern.由于瓶颈通常在处理而不是在处理文件时读取,因此我使用生产者-消费者模式使用 go。 To avoid locking I'd look at lock free lists.为了避免锁定,我会查看无锁定列表。 Since you are using C# you can take a look at Julian Bucknall's Lock-Free List code.由于您使用的是 C#,您可以查看 Julian Bucknall 的无锁列表代码。

@lomaxx @lomaxx

@Derek & Mark: I wish there was a way to accept 2 answers. @Derek & Mark:我希望有办法接受 2 个答案。 I'm going to have to end up going with Wolfbyte's solution because if I split the file into n sections there is the potential for a thread to come across a batch of "slow" transactions, however if I was processing a file where each process was guaranteed to require an equal amount of processing then I really like your solution of just splitting the file into chunks and assigning each chunk to a thread and being done with it.我最终将不得不使用 Wolfbyte 的解决方案,因为如果我将文件拆分为 n 个部分,则线程可能会遇到一批“慢”事务,但是如果我正在处理每个进程的文件保证需要等量的处理,那么我真的很喜欢您的解决方案,即只需将文件拆分为块并将每个块分配给一个线程并完成它。

No worries.不用担心。 If clustered "slow" transactions is a issue, then the queuing solution is the way to go.如果集群“慢”事务是一个问题,那么排队解决方案是 go 的方法。 Depending on how fast or slow the average transaction is, you might also want to look at assigning multiple lines at a time to each worker.根据平均事务的快慢,您可能还想考虑一次将多行分配给每个工作人员。 This will cut down on synchronization overhead.这将减少同步开销。 Likewise, you might need to optimize your buffer size.同样,您可能需要优化缓冲区大小。 Of course, both of these are optimizations that you should probably only do after profiling.当然,这两个都是优化,您可能只应该在分析之后进行。 (No point in worrying about synchronization if it's not a bottleneck.) (如果它不是瓶颈,那么担心同步是没有意义的。)

If the text that you are parsing is made up of repeated strings and tokens, break the file into chunks and for each chunk you could have one thread pre-parse it into tokens consisting of keywords, "punctuation", ID strings, and values.如果您正在解析的文本由重复的字符串和标记组成,请将文件分成块,并且对于每个块,您可以让一个线程将其预解析为由关键字、“标点符号”、ID 字符串和值组成的标记。 String compares and lookups can be quite expensive and passing this off to several worker threads can speed up the purely logical / semantic part of the code if it doesn't have to do the string lookups and comparisons.字符串比较和查找可能非常昂贵,如果不必进行字符串查找和比较,将其传递给多个工作线程可以加速代码的纯逻辑/语义部分。

The pre-parsed data chunks (where you have already done all the string comparisons and "tokenized" it) can then be passed to the part of the code that would actually look at the semantics and ordering of the tokenized data.然后可以将预解析的数据块(您已经完成所有字符串比较并“标记化”它)传递给实际查看标记化数据的语义和排序的代码部分。

Also, you mention you are concerned with the size of your file occupying a large amount of memory.另外,您提到您担心占用大量 memory 的文件大小。 There are a couple things you could do to cut back on your memory budget.您可以采取一些措施来减少 memory 预算。

Split the file into chunks and parse it.将文件拆分成块并解析它。 Read in only as many chunks as you are working on at a time plus a few for "read ahead" so you do not stall on disk when you finish processing a chunk before you go to the next chunk.一次只读取与您正在处理的块一样多的块,再加上一些用于“预读”的块,因此当您在 go 到下一个块之前完成处理块时,您不会在磁盘上停顿。

Alternatively, large files can be memory mapped and "demand" loaded.或者,大文件可以被 memory 映射并“按需”加载。 If you have more threads working on processing the file than CPUs (usually threads = 1.5-2X CPU's is a good number for demand paging apps), the threads that are stalling on IO for the memory mapped file will halt automatically from the OS until their memory is ready and the other threads will continue to process.如果处理文件的线程比 CPU 多(通常线程 = 1.5-2X CPU 是需求分页应用程序的一个好数字),则在 memory 映射文件的 IO 上停止的线程将自动从操作系统停止,直到它们memory 已准备就绪,其他线程将继续处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM