简体   繁体   中英

Parse text with uncertain number of fields

I have a file (~50,000 lines) text.txt as below, which contains some gene info from five individuals (AB, BB, CA, DD, GG). The \t in the file is a tab seperator. There are also a lot of info that are not useful in the file, and I would like to clean it up. So What I need is to extract the species name with 'transcript=' id, if they have one, and also extract the 'DD:' and 'GG:' parts.

$head text.txt

GeneA\tAB:xrbyk | jdnif | otherText\tBB:abdf | jdhfkc | otherDifferentText\tCA:bdmf | nfjvks | transcript=aaabb.1\tDD:hudnf.1 type=cds\tGG:jdubf.1 type=cds
GeneB\tBB:dfsfg | dfsfvdf | otherDifferent\tCA:zdcfsdf | xfgdfgs | transcript=sdfs.1\tDD:sdfsw.1 type=cds\tGG:fghfg.1 type=cds
GeneC\tAB:dsfsdf | xdvv | otherText1\tBB:xdsd | sdfsdf | otherDifferentText2\tDD:hudnf.1 type=cds\tGG:jdubf.1 type=cds
GeneD\tAB:dfsdf | Asda | transcript=asdasd.2\tCA:bdmf | nfjvks | transcript=aaabb.1\tDD:hudnf.1 type=cds\tGG:jdubf.1 type=cds

and I would like the output to be

GeneA\tCA:transcript=aaabb.1\tDD:hudnf.1\tGG:jdubf.1
GeneB\tCA:transcript=sdfs.1\tDD:sdfsw.1\tGG:fghfg.1
GeneC\tDD:hudnf.1\tGG:jdubf.1
GeneD\tAB:transcript=asdasd.2\tCA:transcript=aaabb.1\tDD:hudnf.1\tGG:jdubf.1

I have been searching and thinking for a whole afternoon already, and only have the idea of tearing this file apart by species with the first column Gene name, then separately modify each file, and finally merge files together according to the gene name. But as you see, each line of the file does not necessary have the same number of fields, and so I can't simply use awk to print a certain column. I'm not sure how I can tear them up by species.

I tried to mimic the use of this one How to use sed/grep to extract text between two words? , but did not come with success. I also read a bit about Python in how to split text, (as I'm starting to learn this language), but still can't figure it out. Could anyone please help? Thanks a lot!

UPDATE OF CLARIFICATION OF THE INPUT DATA: In the example showed above, the gene info of each individual is separated by tab (\t), which means that all the text after the inidividual name plus colon (eg AB:) belongs to the individual (eg "xrbyk | jdnif | otherText" for AB). Whether to keep the individual in the final output depends on whether there is the information of "transcript=" for that individual, except for DD and GG. This is why in the final output the 1st line start with CA but not with AB.

Assuming those \t in your sample text are real tabs, this Perl one liner will do it. If they're literal \t text then this needs to be tweaked a tad. Put each field you want to grab in the regex alternation after GG: .

perl -lne '@wanted = $_ =~ m{(^Gene[ABCD]|(?:transcript=|DD:|GG:)\S+)}g; print join "\t", @wanted if @wanted; ' inputfile.txt > outputfile.txt


Output:

GeneA   transcript=aaabb.1      DD:hudnf.1      GG:jdubf.1
GeneB   transcript=sdfs.1       DD:sdfsw.1      GG:fghfg.1
GeneC   DD:hudnf.1      GG:jdubf.1
GeneD   transcript=asdasd.2     transcript=aaabb.1      DD:hudnf.1      GG:jdubf.1

This solution is a bit long, but should be easy to work with:

#!/usr/bin/env python3

# main.py

import csv
import fileinput
import re


def filter_fields(row):
    output = []
    for field_number, field in enumerate(row, 1):
        if field_number == 1:
            output.append(field)
        elif "DD:" in field or "GG:" in field:
            output.append(field.split()[0])
        elif "transcript=" in field:
            # Remove stuff from after the colon to the last space
            output.append(re.sub(r":.* ", ":", field))

    return "\t".join(output)


reader = csv.reader(fileinput.input(), delimiter="\t")
for row  in reader:
    print(filter_fields(row))

How to run it:

# Output to the screen
python3 main.py text.txt

# Output to a file
python3 main.py text.txt > out.txt

# Use as a filter
cat text.txt | python3 main.py

Notes

  • In this solution, each line of text is broken into a row of fields.
  • The function filter_fields will take each row, decide what field to to keep and reformat. It then return those fields, tab separated.
  • The re.sub(...) call says: Delete everything after the colon, up to the last space.

The task is trivial and can be solved in one line of code by defining regular expression specifying information of interest.

To test on input data substitute while <DATA> with while <> and give filename as an argument to the script.

use strict;
use warnings;
use feature 'say';

my $re = qr/(Gene.|[A-Z]{2}:(?=[^:]*transcript=)|transcript=\S+|GG:\S+|DD:\S+)/;

say join("\t",/$re/g) while <DATA>;

exit 0;

__DATA__
GeneA   AB:xrbyk | jdnif | otherText    BB:abdf | jdhfkc | otherDifferentText   CA:bdmf | nfjvks | transcript=aaabb.1   DD:hudnf.1 type=cds GG:jdubf.1 type=cds
GeneB   BB:dfsfg | dfsfvdf | otherDifferent CA:zdcfsdf | xfgdfgs | transcript=sdfs.1    DD:sdfsw.1 type=cds GG:fghfg.1 type=cds
GeneC   AB:dsfsdf | xdvv | otherText1   BB:xdsd | sdfsdf | otherDifferentText2  DD:hudnf.1 type=cds GG:jdubf.1 type=cds
GeneD   AB:dfsdf | Asda | transcript=asdasd.2   CA:bdmf | nfjvks | transcript=aaabb.1   DD:hudnf.1 type=cds GG:jdubf.1 type=cds

Output

GeneA   CA:     transcript=aaabb.1      DD:hudnf.1      GG:jdubf.1
GeneB   CA:     transcript=sdfs.1       DD:sdfsw.1      GG:fghfg.1
GeneC   DD:hudnf.1      GG:jdubf.1
GeneD   AB:     transcript=asdasd.2     CA:     transcript=aaabb.1      DD:hudnf.1      GG:jdubf.1

With a small code change the exact desired output can be achieved

use strict;
use warnings;
use feature 'say';

my $re = qr/(Gene.|[A-Z]{2}:transcript=\S+|GG:\S+|DD:\S+)/;

while( <DATA> ) {
    s/[^:]*(?=transcript=)//g;
    say join("\t",/$re/g);
}

exit 0;

__DATA__
GeneA   AB:xrbyk | jdnif | otherText    BB:abdf | jdhfkc | otherDifferentText   CA:bdmf | nfjvks | transcript=aaabb.1   DD:hudnf.1 type=cds GG:jdubf.1 type=cds
GeneB   BB:dfsfg | dfsfvdf | otherDifferent CA:zdcfsdf | xfgdfgs | transcript=sdfs.1    DD:sdfsw.1 type=cds GG:fghfg.1 type=cds
GeneC   AB:dsfsdf | xdvv | otherText1   BB:xdsd | sdfsdf | otherDifferentText2  DD:hudnf.1 type=cds GG:jdubf.1 type=cds
GeneD   AB:dfsdf | Asda | transcript=asdasd.2   CA:bdmf | nfjvks | transcript=aaabb.1   DD:hudnf.1 type=cds GG:jdubf.1 type=cds

Output

GeneA   CA:transcript=aaabb.1   DD:hudnf.1      GG:jdubf.1
GeneB   CA:transcript=sdfs.1    DD:sdfsw.1      GG:fghfg.1
GeneC   DD:hudnf.1      GG:jdubf.1
GeneD   AB:transcript=asdasd.2  CA:transcript=aaabb.1   DD:hudnf.1      GG:jdubf.1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM