如何在Java中8GB平面文件中的无序名称列表中找到名称

Question

Ok, so we have this problem and I know I can use InputStream to read stream instead of reading the whole file as that would cause the Memory issues. 好的，所以我们有这个问题，我知道我可以使用InputStream读取流而不是读取整个文件，因为那样会导致内存问题。

Referring to this answer: https://stackoverflow.com/a/14037510/1316967 引用此答案： https : //stackoverflow.com/a/14037510/1316967

However, the concern is speed, as I would, in this case, be reading each line of the entire file. 但是，问题在于速度，因为在这种情况下，我会读取整个文件的每一行。 Considering this file contains millions of names in an unordered fashion and this operation has to be achieved in few seconds, how do I go about solving this problem. 考虑到该文件以无序的方式包含数百万个名称，并且必须在几秒钟内完成此操作，因此我该如何解决此问题。

Answer 1

Because the list is unordered there is no alternative to reading the entire file. 由于列表是无序的，因此无法读取整个文件。

If you're lucky, the first name is the name you're looking for: o(1). 如果幸运的话，名字就是您要寻找的名字：o（1）。

If you're unlucky, it's the last name: O(n). 如果您不走运，请使用姓氏：O（n）。

Apart from this, it doesn't matter if you do it the java.io way ( Files.newBufferedReader() ) or the java.nio way ( Files.newByteChannel() ), they both - more or less - perform the same. 除此之外，您是否以java.io方式（ Files.newBufferedReader() ）或java.nio方式（ Files.newByteChannel() ） Files.newByteChannel() ，它们或多或少都执行相同的操作。 If the input file is line based (as in your case), you may use 如果输入文件基于行（如您的情况），则可以使用

Files.lines().filter(l -> name.equals(l)).findFirst();

which internally uses a BufferedReader. 内部使用BufferedReader。

If you really wan't to speed up things, you have to sort the names in the file (see How do I sort very large files ), now you're able to read from an 如果您真的不想加快速度，则必须对文件中的名称进行排序（请参阅如何对大文件进行排序），现在您可以从

EDIT: ordered* list using an index* 编辑：使用索引的有序列表

Once you have an ordered list, you could fast-scan and create an index using a TreeMap and then jump right to correct file position (use a RandomAccessFile or SeekableByteChannel ) and read the name. 获得排序列表后，可以使用TreeMap快速扫描并创建索引，然后向右跳以更正文件位置（使用RandomAccessFile或SeekableByteChannel ）并读取名称。

For example: 例如：

long blockSize = 1048576L;
Path file = Paths.get("yourFile");

long fileSize = Files.size(file);
RandomAccessFile raf = new RandomAccessFile(file.toFile(), "r");

//create the index
TreeMap<String, Long> index = new TreeMap<>();
for(long pos = 0; pos < fileSize; pos += blockSize) {
     //jump the next block
     raf.seek(pos);
     index.put(raf.readLine(), pos);
 }

 //get the position of a name
 String name = "someName";

 //get the beginning and end of the block
 long offset = Optional.ofNullable(index.lowerEntry(name)).map(Map.Entry::getValue).orElse(0L);
 long limit = Optional.ofNullable(index.ceilingEntry(name)).map(Map.Entry::getValue).orElse(fileSize);

 //move the pointer to the offset position
 raf.seek(offset);
 long cur;
 while((cur = raf.getFilePointer())  < limit){
      if(name.equals(raf.readLine())) {
          return cur;
      }
 }

The block size is a tradeoff between index-size, index-creation time and data-access time. 块大小是索引大小，索引创建时间和数据访问时间之间的权衡。 The larger the blocks, the smaller the index and index-creation time but the larger the data-access time. 块越大，索引和索引创建时间越短，但数据访问时间越大。

Answer 2

I would suggest to move the data to a database (checkout SQLite for a serverless option). 我建议将数据移动到数据库（无服务器选项的检出SQLite）。

If that is not possible, you can try to have multiple threads reading the file, each starting at a different offset in the file and reading only a portion of the file. 如果无法做到这一点，则可以尝试让多个线程读取文件，每个线程都从文件中的不同偏移量开始，并且仅读取文件的一部分。

You would have to use a RandomAccessFile . 您将必须使用RandomAccessFile 。 This will only be beneficial if you are on a RAID system, as benchmarked here: http://www.drdobbs.com/parallel/multithreaded-file-io/220300055?pgno=2 仅当您使用RAID系统时，这才是有益的，如此处所述： http : //www.drdobbs.com/parallel/multithreaded-file-io/220300055?pgno=2

如何在Java中8GB平面文件中的无序名称列表中找到名称

问题描述

2 个解决方案

解决方案1
5 2017-03-21 09:07:29

EDIT: ordered* list using an index* 编辑：使用索引的有序列表

解决方案2
1 2017-03-21 09:09:23

如何在Java中8GB平面文件中的无序名称列表中找到名称

问题描述

2 个解决方案

解决方案1 5 2017-03-21 09:07:29

EDIT: ordered list using an index 编辑：使用索引的有序列表

解决方案2 1 2017-03-21 09:09:23

解决方案1
5 2017-03-21 09:07:29

EDIT: ordered* list using an index* 编辑：使用索引的有序列表

解决方案2
1 2017-03-21 09:09:23