removing elements from a list of genes

Question

I have a list like this:

['>ENST00000262144 cds:known chromosome:GRCh37:16:74907468:75019046:-1 gene:ENSG00000103091 gene_biotype:protein_coding transcript_biotype:protein_coding', 
 '>ENST00000446813 cds:known chromosome:GRCh37:7:72349936:72419009:1 gene:ENSG00000196313 gene_biotype:protein_coding transcript_biotype:protein_coding']

I want to make a new list with the same dimension and order but in the new list I will keep only gene id. the results would be like this:

['ENSG00000103091', 'ENSG00000196313']

I am using python. do you guys know how to do that? thanks

Answer 1

Just use some basic list comprehension:

lst = ['>ENST00000262144 cds:known chromosome:GRCh37:16:74907468:75019046:-1 gene:ENSG00000103091 gene_biotype:protein_coding transcript_biotype:protein_coding', '>ENST00000446813 cds:known chromosome:GRCh37:7:72349936:72419009:1 gene:ENSG00000196313 gene_biotype:protein_coding transcript_biotype:protein_coding']

res = [el[5:] for s in lst for el in s.split() if el.startswith('gene:')]

If you prefer to do this using regular for-loops instead, use this:

lst = ['>ENST00000262144 cds:known chromosome:GRCh37:16:74907468:75019046:-1 gene:ENSG00000103091 gene_biotype:protein_coding transcript_biotype:protein_coding', '>ENST00000446813 cds:known chromosome:GRCh37:7:72349936:72419009:1 gene:ENSG00000196313 gene_biotype:protein_coding transcript_biotype:protein_coding']

res = []
for el in lst: # for each string in your list
    l = el.split() # create a second list, of split strings
    for s in l: # for each string in the 'split strings' list
        if s.startswith('gene:'): # if the string starts with 'gene:' we know we have match
            res.append(s[5:]) # so skip the 'gene:' part of the string, and append the rest to a list

Answer 2

For each string in the list:
    Split the string on spaces (Python **split** command)
    Find the element starting with "gene:"
    Keep the rest of the string (grab the slice [5:] of that element)

Do you have enough basic Python knowledge to take it from there? If not, I suggest that you consult the string method documentation .

Answer 3

This is by no means the most Pythonic way to achieve this but it should do what you want.

l = [
    '>ENST00000262144 cds:known chromosome:GRCh37:16:74907468:75019046:-1 gene:ENSG00000103091 gene_biotype:protein_coding transcript_biotype:protein_coding',
    '>ENST00000446813 cds:known chromosome:GRCh37:7:72349936:72419009:1 gene:ENSG00000196313 gene_biotype:protein_coding transcript_biotype:protein_coding'
]
genes = []
for e in l:
    e = e.split('gene:')
    gene = ''
    for c in e[1]:
        if c != ' ':
            gene += c
        else:
            break
    genes.append(gene)

print(genes)

Loop through the elements in the list then split them on gene: after that append all the chars to a string and add it to an array.

removing elements from a list of genes

Question

3 answers

solution1
1 2016-10-12 22:49:58

solution2
0 2016-10-12 22:36:46

solution3
0 2016-10-12 22:48:21

removing elements from a list of genes

Question

3 answers

solution1 1 2016-10-12 22:49:58

solution2 0 2016-10-12 22:36:46

solution3 0 2016-10-12 22:48:21

solution1
1 2016-10-12 22:49:58

solution2
0 2016-10-12 22:36:46

solution3
0 2016-10-12 22:48:21