简体   繁体   中英

filtering based on more than one factor in python

I have a text file with 3 columns and want to do filtering based on the 3rd column. the 1st column has ids and the 3rd column has sequence of characters. in the 1st column each id is repeated but each repeat has a different sequence with different length in the 3rd column. and in some cases, since there is no sequence, it is replaced by "not present" . I want to take only one repeat of each id with sequence and also the sequence must be longest sequence.

example:

RPL17   ENST00000584364 not present
RPL17   ENST00000579248 CTGCGTTGCTCCGAGGGCCCAATCCTCCTGCCATCGCCGCCATCCTGGCTTCGGGGGCGCCGGCCT
RPL17   ENST00000580210 GCCCGTGTGGCTACTTCTGTGGAAGCAGTGCTGTAGTTACTGGAAGATAAAAGGGAAAGCAAGCCCTTGGTGGGGGAAA
RPL18   ENST00000551749 not present
RPL18   ENST00000546623 not present
RPL18   ENST00000552588 TCTCTCTTTCCGGACCTGGCCGAGCAGGAGGCGCCATC
RPL18   ENST00000547897 ACCTGGCCGAGCAGGAGGCGCCATC
RPL18   ENST00000550645 GCCGAGCAGGAGGCGCCATC
RPL18   ENST00000552705 not present

results:

RPL17   ENST00000580210 GCCCGTGTGGCTACTTCTGTGGAAGCAGTGCTGTAGTTACTGGAAGATAAAAGGGAAAGCAAGCCCTTGGTGGGGGAAA
RPL18   ENST00000552588 TCTCTCTTTCCGGACCTGGCCGAGCAGGAGGCGCCATC

I wrote this code and I changed the middle part a couple of times but did not work like what I want.

with open("file.txt") as f, open('test.txt', 'w') as outfile:
    for line in f:
        line=line.split(",")
           .
           .
           .
           outfile.writerow(entry)

It looks like the input file is a columnar format. So first we have to figure out which fields are in which columns, and then we can use a dict to make sure we only keep the longest sequence for a given ID.

Here is the meat of what I think you are asking for:

# 00000000001111111111222222222233333333334
# 01234567890123456789012345678901234567890
# RPL17   ENST00000584364 not present
from collections import OrderedDict
sequences = OrderedDict()
with open("file.txt") as f, open('test.txt', 'w') as outfile:
    for line in f:
        st_id = line[:8].strip()
        sequence = line[24:].strip()
        value, _ = sequences.get(st_id, ('', None))
        if not value or value == 'not present' or len(sequence) > len(value):
            sequences[st_id] = (sequence, line)
    for _, line in sequences.values():
        outfile.write(line)
from collections import defaultdict

d = defaultdict(list)
with open('you_data.txt') as f, open('out.txt', 'w') as out:
    s_line = [line.split('   ')for line in f]
    for k, v in s_line:
        d[k].append(v)
# {'RPL18': ['ENST00000551749 not present\n', 'ENST00000546623 not present\n', 'ENST00000552588 TCTCTCTTTCCGGACCTGGCCGAGCAGGAGGCGCCATC\n', 'ENST00000547897 ACCTGGCCGAGCAGGAGGCGCCATC\n', 'ENST00000550645 GCCGAGCAGGAGGCGCCATC\n', 'ENST00000552705 not present']
    for k, v in d.items():
        long_v = sorted(v, key=len, reverse=True)[0]
        out.write('   '.join([k, long_v]))

out:

RPL18   ENST00000552588 TCTCTCTTTCCGGACCTGGCCGAGCAGGAGGCGCCATC
RPL17   ENST00000580210 GCCCGTGTGGCTACTTCTGTGGAAGCAGTGCTGTAGTTACTGGAAGATAAAAGGGAAAGCAAGCCCTTGGTGGGGGAAA

在此处输入图片说明

I'm pretty sure this is what you want, although I'm sure it could be cleaned up a bit. max , combined with itemgetter will return the tuple with the line with the longest sequence, and since this does it for every id, it should be exactly what you want, and likely the fastest sort method.

I used a comma as the separator, since you said the data was separated by commas, although what you showed us was separated by spaces, but you can change that to whatever your separators are. The output I comma separated also, but you can change that as well to whatever your output separator should be.

UPDATE: Previous final line didn't actually setup the row properly, and I didn't reset lines to be empty after writing rows, so it would have not worked properly. Also, since I would have repeated code, I put the important line that will do what you need into a function ( make_row ).

I have tested this with commas separating the data, and it works perfectly.

from operator import itemgetter
import csv


def make_row(lines):
    return map(str.strip, max(lines, key=itemgetter(2)))

with open("file.txt") as f, open('test.txt', 'w') as outfile:
    output = csv.writer(outfile)
    id = ''
    lines = []
    for line in f:
        current_line = line.split(",")
        if current_line[0] != id and lines != []:
            output.writerow(make_row(lines))
            lines=[]
        id = current_line[0]
        if current_line[2].strip() != 'not present':
            lines.append(current_line)
    output.writerow(make_row(lines))  # to catch the last row

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM