简体   繁体   English

Java:如何快速从大型文本文件中提取匹配行?

[英]Java: How to extract matching lines from a large text file fast?

Although aware that there are plenty of offered solutions to my problem in general , I am still not satisfied with the runtime they require in my special case. 虽然知道有很多,一般提供的解决方案,以我的问题,我仍然不满意,他们在我的特殊情况所需要的运行环境。

Consider a 35G large text file in FASTA format, like this: 考虑一个FASTA格式的35G文本文件 ,如下所示:

>Protein_1 So nice and cute little fella
MTTKKCLQKFHLESLGKLGDSFLKYAISIQLFKSYENHYEGLPSIKKNKIISNAALFKLG 
YARKILRFIRNEPFDLKVGLIPSDNSQAYNFGKEFLMPSVKMCSRVK*
>Protein_2 Fancy incredible description of its function
MADDSKFCFFLVSTFLLLAVVVNVTLAANYVPGDDILLNCGGPDNLPDADGRKWGTDIGS
[…] etc.

I need to extract the > lines only . 我需要提取> 专用线。

Using grep '>' proteins.fasta > protein_descriptions.txt to achieve this takes only a couple of minutes. 使用grep '>' proteins.fasta > protein_descriptions.txt只需几分钟。

But using Java 7 this is now already running for over 90 minutes: 但是现在使用Java 7已经运行了90分钟以上:

public static void main(String[] args) throws Exception {
    BufferedReader fastaIn = new BufferedReader(new FileReader(args[0]));
    List<String> l = new ArrayList<String>();
    String str;
    while ((str = fastaIn.readLine()) != null) {
        if (str.startsWith(">")) {
            l.append(str);
        }
    }
    fastaIn.close();
    // …
}

Does anyone have an idea of how to speed this up to grep performance? 有谁知道如何加快grep性能?

Your help will be much appreciated. 您的帮助将不胜感激。 Cheers! 干杯!

If you write it to the outfile immediatelly instead of accumulating objects in the memory it will improve performance (and will be more like what you did with grep anyway). 如果立即将其写入输出文件,而不是在内存中累积对象,则它将提高性能(并且无论如何都将更像是使用grep所做的那样)。

...
BufferedWriter fastaOut = new BufferedWriter(new FileWriter(args[1]));
...
while ((str = fastaIn.readLine()) != null) {
        if (str.startsWith(">")) {
            fastaOut.write(str);
            fastaOut.newLine();
        }
    }
...    
fastaOut.close();

The biojava.org provides a fasta reader. biojava.org提供了Fasta阅读器。 For reading huge files you would have to consider using a SeekableByteChannell and using the ByteBuffers. 为了读取大文件,您必须考虑使用SeekableByteChannell和ByteBuffers。 The biojava library uses bytebuffers. biojava库使用字节缓冲区。

You could probably speed this up considerably using multiple threads. 您可能使用多个线程可以大大加快此过程。 If the file is X bytes long, and you have n threads, you start each thread at X/n intervals, and read X/n bytes. 如果文件的长度为X个字节,并且有n个线程,则以X / n的间隔启动每个线程,并读取X / n个字节。 You will want to synchronize your ArrayList to ensure your results are added correctly 您将需要同步ArrayList以确保正确添加结果

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 来自大文本文件问题的Java字符串匹配 - java string matching from a large text file issue Java,如何从大文件中提取一些文本并将其导入到小文件中 - Java, how to extract some text from a large file and import it into a smaller file 如何直接读取大文本文件中的特定数据行而不搜索C或java中的每一行 - How to Dirctly read Specific lines of data from the large text file without search every line in C , or java 如何从Java中的文本文件中提取数据 - how to extract data from text file in Java 从Java中的大文件读取特定行 - Reading specific lines from a large file in Java 如何从文本文件中提取两个特定行之间的行? - How can i extract lines between two specific lines from text file? Java:从超大型文本文件中读取具有相同前缀的行组 - Java: read groups of lines with same prefix from very large text file 如何使用Java从文本文件中提取数据并将其写入CSV文件 - How to extract data from a text file and write into CSV file in Java 从Java中的大型JSON文件中提取特定文本(密码字符串),而无需使用JSON帮助器类 - extract specific text (password strings) from a large JSON file in java without using the JSON helper classes 如何在 Java 的大文本文件报告中读取/列出每页 30 行? - how to read/list 30 lines per page in a large text file report in Java?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM