简体   繁体   English

如何使用Java从文本文件中的指定索引提取字符串?

[英]How to extract strings from specified indices in text file using Java?

I'm using Java to try and extract characters between specific indices from a text file. 我正在使用Java尝试从文本文件中提取特定索引之间的字符。 It is a big text file and I'm not allowed to load it to the internal memory. 这是一个很大的文本文件,不允许将其加载到内部存储器中。 I'm therefore limited to only reading parts of the file and hence the parts with these specific indices. 因此,我仅限于读取文件的一部分,因此只读取具有这些特定索引的部分。 How to do this? 这个怎么做?

I might also be able to utilize the linux terminal from within Java and then use something like sed or awk but in that case I will have to learn how to deal with these programs as well. 我也许还可以从Java内部利用linux终端,然后使用sed或awk之类的东西,但是在这种情况下,我还必须学习如何处理这些程序。

Either way it has to be quick and the whole execution of the program is not allowed to take more than one second. 无论哪种方式都必须快速,并且整个程序的执行时间不得超过一秒钟。

Grateful for any suggestions! 感谢任何建议!

If the index of the text file corresponds to the byte at that index, then you could use RandomAccessFile to seek to a specific byte and read information directly from there. 如果文本文件的索引与该索引处的byte相对应,则可以使用RandomAccessFile seek特定的byte并直接从那里读取信息。

According to the documentation for RandomAccessFile#seek : 根据RandomAccessFile#seek文档

Sets the file-pointer offset, measured from the beginning of this file, at which the next read or write occurs. 设置文件指针偏移量,从该文件的开头开始测量,在该位置下一次读取或写入。

You can do the following: 您可以执行以下操作:

RandomAccessFile raf = new RandomAccessFile(file, "r");

raf.seek(index);

Where file is your text file, r is the mode (read), and index is the byte at which you want to begin reading. 其中file是您的文本文件, r是模式(读取),而index是您要开始读取的byte

Depending on how your text file is formatted, you can read each byte up until the next newline character \\n , but you also might have to account for that when calling seek (add the number of lines to your index). 根据文本文件的格式设置,您可以读取每个字节,直到下一个换行符\\n为止,但是在调用seek (将行数添加到索引中)时,也可能必须考虑到这一点。

You can stream the file and skip to whichever line you want. 您可以流式传输文件并跳至所需的任何行。 Once you have the line you want you can extract a substring from it as you normally would. 一旦有了所需的行,就可以照常从中提取一个子字符串。

Take a look at this example: 看一下这个例子:

long start = System.currentTimeMillis();

try (Stream<String> lines = Files.lines(Paths.get("myfile.txt"))) {
    String line = lines.skip(500000).findFirst().get();
    String extracted = line.substring(10, 20);
    System.out.println(extracted);

} catch (IOException e) {
    e.printStackTrace();
}

System.out.println("Time taken: " + (System.currentTimeMillis() - start)/1000.0);

I've tested this with a 1gb file that has 1,000,000 lines of text. 我已经使用1gb文件(包含1,000,000行文本)对其进行了测试。 It extracts a small substring from line 500,000. 它从行500,000中提取一个小的子字符串。

Output: 输出:

测试输出

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM