Printing all occurrences in a string starting with x and ending with y

Question

I apologize if this has been asked before but I didn't even know what to search for.

I've just started learning python and I'm currently writing my first program. The idea is to identify all open reading frames in a protein sequence, for non-biologists this means to identify all occurrences of "M...*" in a string.

This is what I have so far and it almost works but prints repeats for every n rather than jumping to the next "M...".

# calculates amino acid sequence from nucleotide sequence
protein = nucleotide_seq.transcribe().translate()
print("5'3' Frame 1: \n" + protein)

# Calculates all open reading frames in protein sequence
for n in range(len(protein)):
    met = protein.find("M", n)
    stop = protein.find("*", met)
    orf = protein[met:stop]
    print("Open reading frame starting at residue " + str(met+1) + " : " + orf)
    nextmet = protein.find("M", stop)
    n += nextmet

Example protein:

DIMGYF*GLTGSR*VLSSGWIRAQSCTECG*SSEAGVEVRGVRQTDRHSQPARSAV*SELQILFSFHLLSNCPELAPVAPGLVFRECPESLVSSRPREESPAAQALLTAAESSGTHAPAGGSRRAAAAAKNFPGWEDRRQVAESRSQLLQAFPAS*ASPRR*RPEGGGEPRKRRRTCAQLRSHRLLNLGEREPRLPGAPSP*QRRRGQVVGVRAAKTRRRPATAGSALIRSAGRAAALGSEFACGLRGTAAHEERSVSDRDFSKPGSARESTSKSAGGILINPALPGASW*GGRSGDDSQRVRALLEKLSLSKAPGGAGVPRLPQPCCGPETCARSPN*PHVK*RTVL*LQRWKRPSMTMPSTPRSSRPRADLMATVTPRS*

Answer 1

n += nextmet doesn't do what you want because when control goes back to the top of the for loop n gets reset to the next number in the range. So instead of using a for loop you could use a while loop. Eg,

maxloop = len(protein)
n = 0
while n < maxloop:
    met = protein.find("M", n)
    if met == -1:
        break
    #etc 
    n = nextmet + 1

I put that that if statement in there because if find fails to find its target it returns -1.

Here's a more complete demo, now that you've given us some data to work with.

protein = '''DIMGYF*GLTGSR*VLSSGWIRAQSCTECG*SSEAGVEVRGVRQTDRHSQPARSAV*
SELQILFSFHLLSNCPELAPVAPGLVFRECPESLVSSRPREESPAAQALLTAAESSGTHAPAGGSRRAAAAA
KNFPGWEDRRQVAESRSQLLQAFPAS*ASPRR*RPEGGGEPRKRRRTCAQLRSHRLLNLGEREPRLPGAPSP
*QRRRGQVVGVRAAKTRRRPATAGSALIRSAGRAAALGSEFACGLRGTAAHEERSVSDRDFSKPGSARESTS
KSAGGILINPALPGASW*GGRSGDDSQRVRALLEKLSLSKAPGGAGVPRLPQPCCGPETCARSPN*PHVK*
RTVL*LQRWKRPSMTMPSTPRSSRPRADLMATVTPRS*'''

#Get rid of newlines
protein = protein.replace('\n', '')

print("5'3' Frame 1:\n{0}\n".format(protein))

maxloop = len(protein)
n = 0
while n < maxloop:
    met = protein.find("M", n)
    if met == -1:
        break

    stop = protein.find("*", met)
    if stop == -1:
        print('Error: no * found for frame starting at residue', met + 1)
        break

    orf = protein[met:stop]
    print("Open reading frame starting at residue", met + 1, ":", orf)

    n = stop + 1

output

 5'3' Frame 1:
DIMGYF*GLTGSR*VLSSGWIRAQSCTECG*SSEAGVEVRGVRQTDRHSQPARSAV*SELQILFSFHLLSNCPELAPVAPGLVFRECPESLVSSRPREESPAAQALLTAAESSGTHAPAGGSRRAAAAAKNFPGWEDRRQVAESRSQLLQAFPAS*ASPRR*RPEGGGEPRKRRRTCAQLRSHRLLNLGEREPRLPGAPSP*QRRRGQVVGVRAAKTRRRPATAGSALIRSAGRAAALGSEFACGLRGTAAHEERSVSDRDFSKPGSARESTSKSAGGILINPALPGASW*GGRSGDDSQRVRALLEKLSLSKAPGGAGVPRLPQPCCGPETCARSPN*PHVK*RTVL*LQRWKRPSMTMPSTPRSSRPRADLMATVTPRS*

Open reading frame starting at residue 3 : MGYF
Open reading frame starting at residue 358 : MTMPSTPRSSRPRADLMATVTPRS

Answer 2

import re
protein = "DIMGYF*GLTGSR*VLSSGWIRAQSCTECG*SSEAGVEVRGVRQTDRHSQPARSAV*SELQILFSFHLLSNCPELAPVAPGLVFRECPESLVSSRPREESPAAQALLTAAESSGTHAPAGGSRRAAAAAKNFPGWEDRRQVAESRSQLLQAFPAS*ASPRR*RPEGGGEPRKRRRTCAQLRSHRLLNLGEREPRLPGAPSP*QRRRGQVVGVRAAKTRRRPATAGSALIRSAGRAAALGSEFACGLRGTAAHEERSVSDRDFSKPGSARESTSKSAGGILINPALPGASW*GGRSGDDSQRVRALLEKLSLSKAPGGAGVPRLPQPCCGPETCARSPN*PHVK*RTVL*LQRWKRPSMTMPSTPRSSRPRADLMATVTPRS*"
for match in re.finditer('M([^\*]+)\*', protein):
    print match.start()+1, match.group()



>3 MGYF*
>358 MTMPSTPRSSRPRADLMATVTPRS*

If M...M..* is not a valid result, you can add M to the prohibited chars: M([^\\*M]+)\\* .

>3 MGYF*
>374 MATVTPRS*

Answer 3

The reason that you are receiving repeats is due to the fact that you are using a for loop and incrementing n by 1 instead of moving n to the end of your previous frame:

# Calculates all open reading frames in protein sequence
n = 0
length = len(protein)
while n < length:
    met = protein.find("M", n)
    stop = protein.find("*", met)
    if stop == -1:  # Stop is beyond boundary of protein
        break
    orf = protein[met:stop]
    print("Open reading frame starting at residue " + str(met+1) + " : " + orf)
    n = stop + 1

Printing all occurrences in a string starting with x and ending with y

Question

3 answers

solution1
0 ACCPTED 2015-11-16 15:28:36

solution2
0 2015-11-16 15:35:47

solution3
0 2015-11-16 15:44:22

Printing all occurrences in a string starting with x and ending with y

Question

3 answers

solution1 0 ACCPTED 2015-11-16 15:28:36

solution2 0 2015-11-16 15:35:47

solution3 0 2015-11-16 15:44:22

solution1
0 ACCPTED 2015-11-16 15:28:36

solution2
0 2015-11-16 15:35:47

solution3
0 2015-11-16 15:44:22