简体繁体 English

Java：多线程字符流解码

[英]Java: multithreaded character stream decoding

原文 2010-08-09 08:39:10 4 4 java/ multithreading/ character-encoding

I am maintaining a high performance CSV parser and try to get the most out of latest technology to improve the throughput. 我正在维护一个高性能的CSV解析器，并尝试充分利用最新技术来提高吞吐量。 For this particular tasks this means: 对于此特定任务，这意味着：

Flash memory (We own a relatively inexpensive PCI-Express card, 1 TB of storage that reaches 1 GB/s sustained read performance) 闪存（我们拥有相对便宜的PCI-Express卡，1 TB存储，可达到1 GB / s的持续读取性能）
Multiple cores (We own a cheap Nehalem server with 16 hardware threads) 多核（我们拥有一个具有16个硬件线程的廉价Nehalem服务器）

The first implementation of the CSV parser was single threaded. CSV解析器的第一个实现是单线程。 File reading, character decoding, field splitting, text parsing, all within the same thread. 文件读取，字符解码，字段拆分，文本解析，都在同一个线程中。 The result was a throughput of about 50MB/s. 结果是吞吐量约为50MB / s。 Not bad but well below the storage limit... 不错但远低于存储限制......

The second implementation uses one thread to read the file (at the byte level), one thread to decode the characters (from ByteBuffer to CharBuffer), and multiple threads to parse the fields (I mean parsing delimitted text fields into doubles, integers, dates...). 第二个实现使用一个线程来读取文件（在字节级别），一个线程来解码字符（从ByteBuffer到CharBuffer），以及多个线程来解析字段（我的意思是将分隔的文本字段解析为双精度，整数，日期...）。 This works well faster, close to 400MB/s on our box. 这种方法运行得更快，在我们的盒子上接近400MB / s。

But still well below the performance of our storage. 但仍远低于我们的存储性能。 And those SSD will improve again in the future, we are not taking the most out of it in Java. 而那些SSD将来会再次改进，我们并没有在Java中充分利用它。 It is clear that the current limitation is the character decoding ( CharsetDecoder.read(...) ). 很明显，当前的限制是字符解码（CharsetDecoder.read（...））。 That is the bottleneck, on a powerful Nehalem processor it transforms bytes into chars at 400MB/s, pretty good, but this has to be single threaded. 这是瓶颈，在强大的Nehalem处理器上，它将字节转换为400MB / s的字符，非常好，但这必须是单线程的。 The CharsetDecoder is somewhat stateful, depending on the used charset, and does not support multithreaded decoding. CharsetDecoder在某种程度上是有状态的，具体取决于使用的字符集，并且不支持多线程解码。

So my question to the community is (and thank you for reading the post so far): does anyone know how to parallelize the charset decoding operation in Java? 所以我对社区的问题是（并且感谢您阅读迄今为止的帖子）：有没有人知道如何在Java中并行化charset解码操作？

4 个解决方案

does anyone know how to parallelize the charset decoding operation in Java? 有谁知道如何在Java中并行化charset解码操作？

You might be able to open multiple input streams to do this (I'm not sure how you'd go about this with NIO, but it must be possible). 您可以打开多个输入流来执行此操作（我不确定您如何使用NIO进行此操作，但必须可行）。

How difficult this would be depends on the encoding you're decoding from. 这将有多困难取决于您正在解码的编码。 You will need a bespoke solution for the target encoding. 您将需要针对目标编码的定制解决方案。 If the encoding has a fixed width (eg Windows-1252), then one byte == one character and decoding is easy. 如果编码具有固定宽度（例如Windows-1252），那么一个字节==一个字符并且解码很容易。

Modern variable-width encodings (like UTF-8 and UTF-16) contain rules for identifying the first byte of a character sequence, so it is possible to jump to the middle of a file and start decoding (you'll have to note the end of the previous chunk, so it is wise to start decoding the end of the file first). 现代可变宽度编码（如UTF-8和UTF-16）包含用于识别字符序列的第一个字节的规则，因此可以跳转到文件的中间并开始解码（您必须注意上一个块的结尾，所以首先开始解码文件的结尾是明智的。

Some legacy variable-width encodings might not be this well designed, so you'll have no option but to decode from the start of the data and read it sequentially. 一些传统的可变宽度编码可能不是这么好设计的，所以你没有选择，只能从数据的开头解码并按顺序读取它。

If it is an option, generate your data as UTF-16BE. 如果是选项，请将数据生成为UTF-16BE。 Then you can cut out decoding and read two bytes straight to a char. 然后你可以剪切解码并直接读取两个字节到char。

If the file is Unicode, watch out for BOM handling, but I'm guessing you're already familiar with many of the low-level details. 如果文件是Unicode，请注意BOM处理，但我猜你已经熟悉了很多低级细节。

It is clear that the current limitation is the character decoding ( CharsetDecoder.read(...) ) 很明显，目前的限制是字符解码（CharsetDecoder.read（...））

How do you know that? 你怎么知道？ Does your monitoring / profiling show conclusively that the decoder thread is using 100% of one of your cores? 您的监控/分析是否最终显示解码器线程正在使用100％的核心？

Another possibility is that the OS is not capable of driving the SSD at its theoretical maximum speed. 另一种可能性是OS无法以其理论最大速度驱动SSD。

If UTF-8 decoding is definitely the bottleneck then it should be possible to do the task in parallel. 如果UTF-8解码肯定是瓶颈，那么它应该可以并行完成任务。 But you will certainly need to implement your own decoders to do this. 但是你肯定需要实现自己的解码器才能做到这一点。

If you know the encoding, and it is either fixed size, or does not contain overlapping byte sequences, you could scan for a special sequence. 如果您知道编码，并且它是固定大小，或者不包含重叠的字节序列，则可以扫描特殊序列。 In CSV, a sequence for newlines might make sense. 在CSV中，换行符的序列可能有意义。 Even if you dynamically detect the encoding, you could run a pass of the first few bytes to determine encoding, and then move on to parallel decoding. 即使您动态检测编码，也可以运行前几个字节的传递来确定编码，然后继续进行并行解码。

Another (crazy) alternative would be to just separate the input into chunks of some arbitrary size, ignore the decoding issues and then decode each of the chunks in parallel. 另一个（疯狂的）替代方案是将输入分成任意大小的块，忽略解码问题，然后并行解码每个块。 However, you want to ensure that the chunks overlap (with a parametrized size). 但是，您希望确保块重叠（使用参数化大小）。 If the overlapping region of the two chunks is decoded the same way by the two threads (and your overlap was big enough for the specified encoding) it should be safe to join the results. 如果两个线程的重叠区域由两个线程以相同的方式解码（并且您的重叠对于指定的编码来说足够大），则加入结果应该是安全的。 The bigger the overlap, the more processing required, and the smaller the probability of error. 重叠越大，所需的处理越多，出错的概率就越小。 Furthermore, if you are in a situation where you know the encoding is UTF-8, or a similarly simple encoding, you could set the overlap quite low (for that client) and still be guaranteed correct operation. 此外，如果您处于您知道编码为UTF-8或类似简单编码的情况，您可以将重叠设置得相当低（对于该客户端）并且仍然保证正确的操作。

If the second chunk turns out to be wrong, you will have to redo it, so it is important to not do to big chunks in parallel. 如果第二个块被证明是错误的，那么你将不得不重做它，所以重要的是不要并行处理大块。 If you do more than two chunks in parallel, it would be important to 'repair' from beginning to end, so that one misaligned block does not result in invalidating the next block (which might be correctly aligned). 如果并行执行两个以上的块，则从头到尾进行“修复”非常重要，这样一个未对齐的块不会导致下一个块无效（可能正确对齐）。