[英]Java: How to extract matching lines from a large text file fast?
Although aware that there are plenty of offered solutions to my problem in general , I am still not satisfied with the runtime they require in my special case. 虽然知道有很多,一般提供的解决方案,以我的问题,我仍然不满意,他们在我的特殊情况所需要的运行环境。
Consider a 35G large text file in FASTA format, like this: 考虑一个FASTA格式的35G大文本文件 ,如下所示:
>Protein_1 So nice and cute little fella MTTKKCLQKFHLESLGKLGDSFLKYAISIQLFKSYENHYEGLPSIKKNKIISNAALFKLG YARKILRFIRNEPFDLKVGLIPSDNSQAYNFGKEFLMPSVKMCSRVK* >Protein_2 Fancy incredible description of its function MADDSKFCFFLVSTFLLLAVVVNVTLAANYVPGDDILLNCGGPDNLPDADGRKWGTDIGS […] etc.
I need to extract the >
lines only . 我需要提取>
专用线。
Using grep '>' proteins.fasta > protein_descriptions.txt
to achieve this takes only a couple of minutes. 使用grep '>' proteins.fasta > protein_descriptions.txt
只需几分钟。
But using Java 7 this is now already running for over 90 minutes: 但是现在使用Java 7已经运行了90分钟以上:
public static void main(String[] args) throws Exception {
BufferedReader fastaIn = new BufferedReader(new FileReader(args[0]));
List<String> l = new ArrayList<String>();
String str;
while ((str = fastaIn.readLine()) != null) {
if (str.startsWith(">")) {
l.append(str);
}
}
fastaIn.close();
// …
}
Does anyone have an idea of how to speed this up to grep
performance? 有谁知道如何加快grep
性能?
Your help will be much appreciated. 您的帮助将不胜感激。 Cheers! 干杯!
If you write it to the outfile immediatelly instead of accumulating objects in the memory it will improve performance (and will be more like what you did with grep anyway). 如果立即将其写入输出文件,而不是在内存中累积对象,则它将提高性能(并且无论如何都将更像是使用grep所做的那样)。
...
BufferedWriter fastaOut = new BufferedWriter(new FileWriter(args[1]));
...
while ((str = fastaIn.readLine()) != null) {
if (str.startsWith(">")) {
fastaOut.write(str);
fastaOut.newLine();
}
}
...
fastaOut.close();
The biojava.org provides a fasta reader. biojava.org提供了Fasta阅读器。 For reading huge files you would have to consider using a SeekableByteChannell and using the ByteBuffers. 为了读取大文件,您必须考虑使用SeekableByteChannell和ByteBuffers。 The biojava library uses bytebuffers. biojava库使用字节缓冲区。
You could probably speed this up considerably using multiple threads. 您可能使用多个线程可以大大加快此过程。 If the file is X bytes long, and you have n threads, you start each thread at X/n intervals, and read X/n bytes. 如果文件的长度为X个字节,并且有n个线程,则以X / n的间隔启动每个线程,并读取X / n个字节。 You will want to synchronize your ArrayList to ensure your results are added correctly 您将需要同步ArrayList以确保正确添加结果
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.