索引大文本文件的最快方法

Question

I want to index large text file about 1 gb, so I store in another file new line positions, to access the file later by RandomAccessFile , here is my code我想索引大约 1 gb 的大文本文件，所以我将新行位置存储在另一个文件中，以便稍后通过RandomAccessFile访问该文件，这是我的代码

               while (true) {
                raf.seek(currentPos);
                byte[] bytes = new byte[1000000];
                raf.read(bytes, 0, bytes.length);
                for (int i = 0; i < bytes.length; i++) {
                    if (bytes[i] == 10) {
                        rafw.writeInt(currentPos + i);
                    }
                }
                currentPos = currentPos + sizeOfPacket;
                if (currentPos > raf.length()) {
                    sizeOfPacket = (int) raf.length() - currentPos;
                } else if (currentPos == raf.length()) {
                    break;
                }
                bytesCounter = bytesCounter + 1000000;
                //Log.d("DicData", "Percentage=" + currentPos + " " + raf.length());
                int progress = (int) (bytesCounter * 100.0 / folderSize + 0.5);
                iDicIndexingListener.onTotalIndexingProgress(progress < 100 ? progress : 100);

Here I check all file bytes for value (10) which means "\n" new line, My big problem is: this proccess takes too much time, about 15 minutes, My question: Is there a way faster than this?在这里，我检查所有文件字节的值 (10)，这意味着“\n”新行，我的大问题是：这个过程需要太多时间，大约 15 分钟，我的问题：有没有比这更快的方法？ Thanks谢谢

Answer 1

Writing and reading a 1 GB file with 1 Mio.使用 1 Mio 写入和读取 1 GB 文件。 lines takes < 10 secs each on my machine.在我的机器上，每条线需要 < 10 秒。 I suspect your performance bottleneck is somewhere else.我怀疑您的性能瓶颈在其他地方。

public class Test {
  public static void main(String[] args) throws Exception {
    File file = new File("test.txt");

    System.out.println("writing 1 GB file with 1 mio. lines...");
    try(FileOutputStream fos = new FileOutputStream(file)) {
      for(int i = 0; i < 1024 * 1024; i++) {
        fos.write(new byte[1023]);
        fos.write(10);
        if(i % 1024 == 0) {
          System.out.println(i / 1024 + " MB...");
        }
      }
    }
    System.out.println("done.");

    System.out.println("reading line positions...");
    List<Long> lineStartPositions = new ArrayList<>();
    lineStartPositions.add(0L);
    long positionInFile = -1;
    byte[] buffer = new byte[1024 * 1024];
    try(FileInputStream fis = new FileInputStream(file)) {
      long read = 0;
      while((read = fis.read(buffer)) != -1) {
        System.out.println("processing MB: " + positionInFile / 1024 / 1024);
        for(int i = 0; i < read; i++) {
          positionInFile++;
          if(buffer[i] == 10) {
            lineStartPositions.add(positionInFile + 1);
          }
        }
      }

      // remove the last line index in case the last byte of the file was a newline
      if(lineStartPositions.get(lineStartPositions.size() - 1) >= file.length()) {
        lineStartPositions.remove(lineStartPositions.size() - 1);
      }
    }

    System.out.println("found: " + lineStartPositions.size());
    System.out.println("expected: " + 1024 * 1024);
  }
}

Answer 2

You can use the lib Scanner to pre-read file to index new line pos:您可以使用 lib Scanner 预读文件来索引新行 pos：

        File file = null;
        //init file here
        int newLineIndex = 0;
        int lineSepLength = System.lineSeparator().length(); // \r, \n or \r\n depend on OS
        Scanner sc = new Scanner(file);
        while(sc.hasNextLine()) {
            newLineIndex = sc.nextLine().length() + lineSepLength;
            //persist newLineIndex
        }

索引大文本文件的最快方法

问题描述

2 个解决方案

解决方案1
0 2021-01-06 01:59:48

解决方案2
-1 2021-01-06 00:58:10

索引大文本文件的最快方法

问题描述

2 个解决方案

解决方案1 0 2021-01-06 01:59:48

解决方案2 -1 2021-01-06 00:58:10

解决方案1
0 2021-01-06 01:59:48

解决方案2
-1 2021-01-06 00:58:10