简体   繁体   English

Java:从超大型文本文件中读取具有相同前缀的行组

[英]Java: read groups of lines with same prefix from very large text file

I have a large (~100GB) text file structured like this: 我有一个大型(〜100GB)文本文件,结构如下:

A,foobar
A,barfoo
A,foobar
B,barfoo
B,barfoo
C,foobar

Each line is a comma-separated pair of values. 每行是一对逗号分隔的值。 The file is sorted by the first value in the pair. 该文件按该对中的第一个值排序。 The lines are variable length. 这些线是可变长度的。 Define a group as being all lines with a common first value, ie with the example quoted above all lines starting with "A," would be a group, all lines starting with "B," would be another group. 将一个组定义为所有具有相同第一个值的行,即上面引用的示例中,所有以“ A”开头的行都是一个组,所有以“ B”开头的行都是另一个组。

The entire file is too large to fit into memory, but if you took all the lines from any individual group will always fit into memory. 整个文件太大,无法容纳到内存中,但是如果您从任何单个组中取出所有行,则总会容纳在内存中。

I have a routine for processing a single such group of lines and writing to a text file. 我有一个例程来处理一行这样的行并写入文本文件。 My problem is that I don't know how best to read the file a group at a time. 我的问题是我不知道如何一次最好地读取文件。 All the groups are of arbitrary, unknown size. 所有组的大小都是任意的,未知。 I have considered two ways: 我考虑过两种方法:

1) Scan the file using a BufferedReader , accumulating the lines from a group in a String or array. 1)使用BufferedReader扫描文件,累积来自字符串或数组中一组的行。 Whenever a line is encountered that belongs to a new group, hold that line in a temporary variable, process the previous group. 每当遇到属于新组的行时,将该行保留在临时变量中,即可处理前一个组。 Clear the accumulator, add the temporary and then continue reading the new group starting from the second line. 清除累加器,添加临时项,然后从第二行开始继续读取新组。

2) Scan the file using a BufferedReader , whenever a line is encountered that belongs to a new group, somehow reset the cursor so that when readLine() is next invoked it starts from the first line of the group instead of the second. 2)使用BufferedReader扫描文件,每当遇到属于新组的一行时,都会以某种方式重置光标,以便在下次调用readLine()时,它从组的第一行而不是第二行开始。 I have looked into mark() and reset() but these require knowing the byte-position of the start of the line. 我已经研究了mark()reset()但是这些都需要知道该行开始的字节位置。

I'm going to go with (1) at the moment, but I would be very grateful if someone could suggest a method that smells less. 目前,我将选择(1),但是如果有人可以提出一种气味较小的方法,我将不胜感激。

I think a PushbackReader would work: 我认为PushbackReader可以工作:

 if (lineBelongsToNewGroup){
     reader.unread(lastLine.toCharArray());
     // probably also unread a newline
 }

I think option 1 is the simplest. 我认为选项1是最简单的。 I would parse the text yourself, rather than use BufferedReader as it will take a lone time to parse 100 GB. 我会自己解析文本,而不是使用BufferedReader解析文本,因为解析100 GB需要花费一个孤独的时间。

The only option which is likely to be faster is to use a binary search accessing the file using RandomAccessFile. 唯一可能更快的选择是使用二进制搜索通过RandomAccessFile访问文件。 You can memory map 100 GB on a 64-bit JVM. 您可以在64位JVM上映射100 GB的内存。 This avoids the need to parse every line which is pretty expensive. 这样就避免了解析每行的开销,而这是非常昂贵的。 An advantage of this approach is that you can use multiple threads Its is far, far more complicated to implement, but should be much faster. 这种方法的优点是可以使用多个线程。它的实现要复杂得多,但要快得多。 Once you have each boundary, you can copy the raw data in bulk without having to parse all the lines. 一旦有了每个边界,就可以批量复制原始数据,而不必解析所有行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何直接读取大文本文件中的特定数据行而不搜索C或java中的每一行 - How to Dirctly read Specific lines of data from the large text file without search every line in C , or java 尝试从java中的非常大的图像文件中读取区域时出错 - Error trying to read a region from a very large image file in java 从Java中的非常大的图像文件中读取区域 - Read region from very large image file in Java Java:如何从文本文件中删除具有相同前缀的字符串? - Java: How to remove strings having the same prefix from a text file? Java:如何快速从大型文本文件中提取匹配行? - Java: How to extract matching lines from a large text file fast? 如何使用Java在两个方向上读取很大文本文件的n行块 - How to read a block of n lines of a very big text file in both directions, using Java 用Java从非常大的Zip文件中读取小文件的有效方法 - Efficient way to read a small file from a very large Zip file in Java 从文本文件中读取特定行,然后输出java - Read Specific Lines from a text file and then outputing java 如何使用Java从文本文件读取奇数行? - How to read odd number of lines from a text file using Java? 使用Java Streams从文本文件一次读取X行? - Read X lines at a time from a text file using Java Streams?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM