[英]How to find a name in unordered list of names in a 8GB flat file in Java
Ok, so we have this problem and I know I can use InputStream to read stream instead of reading the whole file as that would cause the Memory issues. 好的,所以我们有这个问题,我知道我可以使用InputStream读取流而不是读取整个文件,因为那样会导致内存问题。
Referring to this answer: https://stackoverflow.com/a/14037510/1316967 引用此答案: https : //stackoverflow.com/a/14037510/1316967
However, the concern is speed, as I would, in this case, be reading each line of the entire file. 但是,问题在于速度,因为在这种情况下,我会读取整个文件的每一行。 Considering this file contains millions of names in an unordered fashion and this operation has to be achieved in few seconds, how do I go about solving this problem.
考虑到该文件以无序的方式包含数百万个名称,并且必须在几秒钟内完成此操作,因此我该如何解决此问题。
Because the list is unordered there is no alternative to reading the entire file. 由于列表是无序的,因此无法读取整个文件。
If you're lucky, the first name is the name you're looking for: o(1). 如果幸运的话,名字就是您要寻找的名字:o(1)。
If you're unlucky, it's the last name: O(n). 如果您不走运,请使用姓氏:O(n)。
Apart from this, it doesn't matter if you do it the java.io
way ( Files.newBufferedReader()
) or the java.nio
way ( Files.newByteChannel()
), they both - more or less - perform the same. 除此之外,您是否以
java.io
方式( Files.newBufferedReader()
)或java.nio
方式( Files.newByteChannel()
) Files.newByteChannel()
,它们或多或少都执行相同的操作。 If the input file is line based (as in your case), you may use 如果输入文件基于行(如您的情况),则可以使用
Files.lines().filter(l -> name.equals(l)).findFirst();
which internally uses a BufferedReader. 内部使用BufferedReader。
If you really wan't to speed up things, you have to sort the names in the file (see How do I sort very large files ), now you're able to read from an 如果您真的不想加快速度,则必须对文件中的名称进行排序(请参阅如何对大文件进行排序 ),现在您可以从
Once you have an ordered list, you could fast-scan and create an index using a TreeMap
and then jump right to correct file position (use a RandomAccessFile
or SeekableByteChannel
) and read the name. 获得排序列表后,可以使用
TreeMap
快速扫描并创建索引,然后向右跳以更正文件位置(使用RandomAccessFile
或SeekableByteChannel
)并读取名称。
For example: 例如:
long blockSize = 1048576L;
Path file = Paths.get("yourFile");
long fileSize = Files.size(file);
RandomAccessFile raf = new RandomAccessFile(file.toFile(), "r");
//create the index
TreeMap<String, Long> index = new TreeMap<>();
for(long pos = 0; pos < fileSize; pos += blockSize) {
//jump the next block
raf.seek(pos);
index.put(raf.readLine(), pos);
}
//get the position of a name
String name = "someName";
//get the beginning and end of the block
long offset = Optional.ofNullable(index.lowerEntry(name)).map(Map.Entry::getValue).orElse(0L);
long limit = Optional.ofNullable(index.ceilingEntry(name)).map(Map.Entry::getValue).orElse(fileSize);
//move the pointer to the offset position
raf.seek(offset);
long cur;
while((cur = raf.getFilePointer()) < limit){
if(name.equals(raf.readLine())) {
return cur;
}
}
The block size is a tradeoff between index-size, index-creation time and data-access time. 块大小是索引大小,索引创建时间和数据访问时间之间的权衡。 The larger the blocks, the smaller the index and index-creation time but the larger the data-access time.
块越大,索引和索引创建时间越短,但数据访问时间越大。
I would suggest to move the data to a database (checkout SQLite for a serverless option). 我建议将数据移动到数据库(无服务器选项的检出SQLite)。
If that is not possible, you can try to have multiple threads reading the file, each starting at a different offset in the file and reading only a portion of the file. 如果无法做到这一点,则可以尝试让多个线程读取文件,每个线程都从文件中的不同偏移量开始,并且仅读取文件的一部分。
You would have to use a RandomAccessFile . 您将必须使用RandomAccessFile 。 This will only be beneficial if you are on a RAID system, as benchmarked here: http://www.drdobbs.com/parallel/multithreaded-file-io/220300055?pgno=2
仅当您使用RAID系统时,这才是有益的,如此处所述: http : //www.drdobbs.com/parallel/multithreaded-file-io/220300055?pgno=2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.