简体   繁体   English

如何在Java中8GB平面文件中的无序名称列表中找到名称

[英]How to find a name in unordered list of names in a 8GB flat file in Java

Ok, so we have this problem and I know I can use InputStream to read stream instead of reading the whole file as that would cause the Memory issues. 好的,所以我们有这个问题,我知道我可以使用InputStream读取流而不是读取整个文件,因为那样会导致内存问题。

Referring to this answer: https://stackoverflow.com/a/14037510/1316967 引用此答案: https : //stackoverflow.com/a/14037510/1316967

However, the concern is speed, as I would, in this case, be reading each line of the entire file. 但是,问题在于速度,因为在这种情况下,我会读取整个文件的每一行。 Considering this file contains millions of names in an unordered fashion and this operation has to be achieved in few seconds, how do I go about solving this problem. 考虑到该文件以无序的方式包含数百万个名称,并且必须在几秒钟内完成此操作,因此我该如何解决此问题。

Because the list is unordered there is no alternative to reading the entire file. 由于列表是无序的,因此无法读取整个文件。

If you're lucky, the first name is the name you're looking for: o(1). 如果幸运的话,名字就是您要寻找的名字:o(1)。

If you're unlucky, it's the last name: O(n). 如果您不走运,请使用姓氏:O(n)。

Apart from this, it doesn't matter if you do it the java.io way ( Files.newBufferedReader() ) or the java.nio way ( Files.newByteChannel() ), they both - more or less - perform the same. 除此之外,您是否以java.io方式( Files.newBufferedReader() )或java.nio方式( Files.newByteChannel()Files.newByteChannel() ,它们或多或少都执行相同的操作。 If the input file is line based (as in your case), you may use 如果输入文件基于行(如您的情况),则可以使用

Files.lines().filter(l -> name.equals(l)).findFirst();

which internally uses a BufferedReader. 内部使用BufferedReader。

If you really wan't to speed up things, you have to sort the names in the file (see How do I sort very large files ), now you're able to read from an 如果您真的不想加快速度,则必须对文件中的名称进行排序(请参阅如何对大文件进行排序 ),现在您可以从

EDIT: ordered list using an index 编辑:使用索引的有序列表

Once you have an ordered list, you could fast-scan and create an index using a TreeMap and then jump right to correct file position (use a RandomAccessFile or SeekableByteChannel ) and read the name. 获得排序列表后,可以使用TreeMap快速扫描并创建索引,然后向右跳以更正文件位置(使用RandomAccessFileSeekableByteChannel )并读取名称。

For example: 例如:

long blockSize = 1048576L;
Path file = Paths.get("yourFile");

long fileSize = Files.size(file);
RandomAccessFile raf = new RandomAccessFile(file.toFile(), "r");

//create the index
TreeMap<String, Long> index = new TreeMap<>();
for(long pos = 0; pos < fileSize; pos += blockSize) {
     //jump the next block
     raf.seek(pos);
     index.put(raf.readLine(), pos);
 }

 //get the position of a name
 String name = "someName";

 //get the beginning and end of the block
 long offset = Optional.ofNullable(index.lowerEntry(name)).map(Map.Entry::getValue).orElse(0L);
 long limit = Optional.ofNullable(index.ceilingEntry(name)).map(Map.Entry::getValue).orElse(fileSize);

 //move the pointer to the offset position
 raf.seek(offset);
 long cur;
 while((cur = raf.getFilePointer())  < limit){
      if(name.equals(raf.readLine())) {
          return cur;
      }
 }

The block size is a tradeoff between index-size, index-creation time and data-access time. 块大小是索引大小,索引创建时间和数据访问时间之间的权衡。 The larger the blocks, the smaller the index and index-creation time but the larger the data-access time. 块越大,索引和索引创建时间越短,但数据访问时间越大。

I would suggest to move the data to a database (checkout SQLite for a serverless option). 我建议将数据移动到数据库(无服务器选项的检出SQLite)。

If that is not possible, you can try to have multiple threads reading the file, each starting at a different offset in the file and reading only a portion of the file. 如果无法做到这一点,则可以尝试让多个线程读取文件,每个线程都从文件中的不同偏移量开始,并且仅读取文件的一部分。

You would have to use a RandomAccessFile . 您将必须使用RandomAccessFile This will only be beneficial if you are on a RAID system, as benchmarked here: http://www.drdobbs.com/parallel/multithreaded-file-io/220300055?pgno=2 仅当您使用RAID系统时,这才是有益的,如此处所述: http : //www.drdobbs.com/parallel/multithreaded-file-io/220300055?pgno=2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何为sts 3.5.0配置具有8GB RAM的文件sts.ini? - How to config file sts.ini with RAM 8GB for sts 3.5.0? 如何确定要为8GB RAM上的JVM分配多少内存? - How to decide how much memory to allocate to JVM on 8GB RAM? 如何设置Cassandra(> 2.0)JVM堆大小为8GB? - How to set Cassandra (>2.0) JVM heap size of 8GB? 我可以提及4GB和8GB Ram的-XX:MaxPermSize大小是多少,并为此进行计算? - How much -XX:MaxPermSize size i can mention for 4GB and 8GB Ram and calculation for this? 如何使用Java列出Midi文件中的仪器名称? - How to list instrument names in a Midi file with Java? 为什么 Java 19 中的并行 stream 表达式中的 skip() 即使是 8GB 也会导致 OOM? - Why does skip() in this parallel stream expression in Java 19 cause an OOM even with 8GB? 在Windows 7(64位)上将Tomcat Java堆大小设置为8GB时遇到问题? - Problems setting Tomcat Java Heap Size to 8GB on Windows 7 (64 bit)? 是否可以在 64 位 java/linux 环境中以 8gb 堆大小运行 Weblogic - Is it possible to run Weblogic with 8gb heap size in a 64bit java/linux environment 如何使用 Selenium JAVA 点击无序列表 - How to click on unordered list using Selenium JAVA Java 8 - 如何在列表中找到最接近的名称 - Java 8 - How to find the closest name in a list
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM