Java：如何快速从大型文本文件中提取匹配行？

Question

Although aware that there are plenty of offered solutions to my problem in general , I am still not satisfied with the runtime they require in my special case. 虽然知道有很多，一般提供的解决方案，以我的问题，我仍然不满意，他们在我的特殊情况所需要的运行环境。

Consider a 35G large text file in FASTA format, like this: 考虑一个FASTA格式的35G大文本文件 ，如下所示：

>Protein_1 So nice and cute little fella
MTTKKCLQKFHLESLGKLGDSFLKYAISIQLFKSYENHYEGLPSIKKNKIISNAALFKLG 
YARKILRFIRNEPFDLKVGLIPSDNSQAYNFGKEFLMPSVKMCSRVK*
>Protein_2 Fancy incredible description of its function
MADDSKFCFFLVSTFLLLAVVVNVTLAANYVPGDDILLNCGGPDNLPDADGRKWGTDIGS
[…] etc.

I need to extract the > lines only . 我需要提取> 专用线。

Using grep '>' proteins.fasta > protein_descriptions.txt to achieve this takes only a couple of minutes. 使用grep '>' proteins.fasta > protein_descriptions.txt只需几分钟。

But using Java 7 this is now already running for over 90 minutes: 但是现在使用Java 7已经运行了90分钟以上：

public static void main(String[] args) throws Exception {
    BufferedReader fastaIn = new BufferedReader(new FileReader(args[0]));
    List<String> l = new ArrayList<String>();
    String str;
    while ((str = fastaIn.readLine()) != null) {
        if (str.startsWith(">")) {
            l.append(str);
        }
    }
    fastaIn.close();
    // …
}

Does anyone have an idea of how to speed this up to grep performance? 有谁知道如何加快grep性能？

Your help will be much appreciated. 您的帮助将不胜感激。 Cheers! 干杯!

Answer 1

If you write it to the outfile immediatelly instead of accumulating objects in the memory it will improve performance (and will be more like what you did with grep anyway). 如果立即将其写入输出文件，而不是在内存中累积对象，则它将提高性能（并且无论如何都将更像是使用grep所做的那样）。

...
BufferedWriter fastaOut = new BufferedWriter(new FileWriter(args[1]));
...
while ((str = fastaIn.readLine()) != null) {
        if (str.startsWith(">")) {
            fastaOut.write(str);
            fastaOut.newLine();
        }
    }
...    
fastaOut.close();

Answer 2

The biojava.org provides a fasta reader. biojava.org提供了Fasta阅读器。 For reading huge files you would have to consider using a SeekableByteChannell and using the ByteBuffers. 为了读取大文件，您必须考虑使用SeekableByteChannell和ByteBuffers。 The biojava library uses bytebuffers. biojava库使用字节缓冲区。

Answer 3

You could probably speed this up considerably using multiple threads. 您可能使用多个线程可以大大加快此过程。 If the file is X bytes long, and you have n threads, you start each thread at X/n intervals, and read X/n bytes. 如果文件的长度为X个字节，并且有n个线程，则以X / n的间隔启动每个线程，并读取X / n个字节。 You will want to synchronize your ArrayList to ensure your results are added correctly 您将需要同步ArrayList以确保正确添加结果

Java：如何快速从大型文本文件中提取匹配行？

问题描述

3 个解决方案

解决方案1
1 2014-12-22 20:43:03

解决方案2
1 2014-12-22 21:15:42

解决方案3
0 2014-12-22 20:46:36

Java：如何快速从大型文本文件中提取匹配行？

问题描述

3 个解决方案

解决方案1 1 2014-12-22 20:43:03

解决方案2 1 2014-12-22 21:15:42

解决方案3 0 2014-12-22 20:46:36

解决方案1
1 2014-12-22 20:43:03

解决方案2
1 2014-12-22 21:15:42

解决方案3
0 2014-12-22 20:46:36