简体   繁体   中英

Reading ahead with BufferedReader (Java)

I'm writing a parser for files that look like this:

LOCUS       SCU49845     5028 bp    DNA             PLN       21-JUN-1999
DEFINITION  Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
            (AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION   U49845
VERSION     U49845.1  GI:1293613

I want to get information preceded by certain tags (DEFINITION, VERSION etc.) but some descriptions cover multiple lines and I do need all of it. This is a problem when using BufferdReader to read my file. I almost figured it out by using mark() and reset() but when executing my program I noticed that it only works for one tag and other tags are somehow skipped. This is the code I have so far:

Pattern pTag = Pattern.compile("^[A-Z]{2,}");//regex: 2 or more uppercase letters is a tag

Matcher mTagCurr = pTag.matcher(line);                

if (mTagCurr.find()) {
    reader.mark(1000);

    String nextLine = reader.readLine();
    Matcher mTagNext = pTag.matcher(nextLine);                    
    if (mTagNext.find()){
        reader.reset();
        continue;
    }

    Pattern pWhite = Pattern.compile("^\\s{6,}");
    Matcher mWhite = pWhite.matcher(nextLine);
    while (mWhite.find()) {
        line  = line.concat(nextLine);
    }                    
    System.out.println(line);
}

This piece of code is supposed to find tags and concatenate descriptions that cover more than one line. Some answers I found here advised using Scanner. This is not an option for me. The files I work with can be very large (largest I encountered was >50GB) and by using BufferedReader I wish to put less of a strain on my system.

I suggest accumulating the information you get as your read it in a single pass parser. This will be simpler and faster in this case I suspect.

BTW, you want to cache your Patterns as creating them is quite expensive. You may find that you want ovoid using them entirely in some cases.

The code starts by finding a continuation line and calling reset() if it does not find it, but the code that reads additional lines does not seem to do that. Could it be reading the start of another section in the Genbank file and not putting it back? I don't see all the loop control code here, but what I do see appears to be correct.

If all else fails and you need something easy, there's always BioJava (see How to Read a Genbank File with Biojava3 and see if it helps). I have tried to use BioJava for my own projects, but it always falls a little short.

When I have written FASTA and FASTQ parsers, I read into a byte or char buffer and process it that way, but there is more buffer management code to write. That way, I don't have to worry about putting bytes back in a buffer. This can also avoid regex, which can be expensive in a time-critical application. Of course, this take more time to implement.

Tip: For fastest implementation if you are managing the buffer yourself, check out NIO ( Java NIO Tutorial ). I have seen give up a 10x speedup in some cases (writing data). The only drawback is that I have not found an easy way to read gzipped sequence data with NIO yet.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM