简体   繁体   中英

How to read only a particular part of a string or a sub-string

The goal of this project is to open and read a DNA sequences from a text file, eg if the sub-string is AGATC and then the consecutive sub-string is also, we add to the counter, once the consecutive sub-string is no longer AGATC the aim is to tally it to the highest score in the range, clear the counter and continue searching so as to find the longest consecutive sequence.

        str_count = []
        counter = 0
        highest = 0
        # read sequence
    
        with open(argv[2], "r") as seq:
            seqRead = seq.read()
            for i in range(len(seqRead)):
                #search for consecutive AGATC
                if i == 'A' and seqRead[i:i+6] == 'AGATC':
                    while i == 'A' and seqRead[i:i+6] == 'AGATC':
                        counter += 1
                        i = i + 5
                if highest < counter:
                    highest = counter
                    counter = 0
                else:
                    counter = 0

Right now the problem I think i am having is I don't think I am comparing the text sequence correctly and thus not reading the correct sequence of letters in the string.

My aim is to track 'i' as a 'A' and then extract sequential 4 letters and compare it to 'AGATC' and then if it matches increase the counter and change 'i' to the letter following the compared, and if it is A repeating until no longer consecutively, and then adding to highest until reaching the end. This is the im atleast, however when running the debugger I notice that it never enters the first if statement, which leads me to believe the way I am comparing is incorrect.

Sample input:

AGATCAGATCAGATCAGATCAGATCDJFDHFDTTTTCCSSDDSDDGFJFHAGATCAGATCAGATCAGATCAGATCAGATGJFHJGHJDSHGDKFSAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCDKFDKDFKGJKDFKAGATCkFGJKFDDAGATCDFKJKFJFKDJKAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCKFDHDFKFDHKGHKDFGJFKHDFK

Expected output: highest = 30

Due to the fact that the longest consecutive appearance of AGATC is 30.

input:

AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG

output: highest = 4

Am i mistaken with how to use the seqRead[i:i+6]?

And how could I go about doing this better?

Your substring is too long, seqRead[i:i+6] will give a string of length 6 characters, rather than 5. That line (and the other line which makes a similar comparison) should be seqRead[i:i+5] instead. Also, you were trying to compare your iterator ( i ) to a letter, when I think you meant to compare the letter at the position of the iterator in seqRead instead. i == 'A' should be changed to seqRead[i] == 'A' :

    str_count = []
    counter = 0
    highest = 0
    # read sequence

    with open(argv[2], "r") as seq:
        seqRead = seq.read()
        for i in range(len(seqRead)):
            #search for consecutive AGATC
            if seqRead[i] == 'A' and seqRead[i:i+5] == 'AGATC':
                while seqRead[i] == 'A' and seqRead[i:i+5] == 'AGATC':
                    counter += 1
                    i = i + 5
            if highest < counter:
                highest = counter
                counter = 0
            else:
                counter = 0

In your code if before while loop is redundant. And you're slicing an incorrect substring, here is the updated and simplified code:

for i in range(len(seqRead)):
    while seqRead[i:i+5] == "AGATC":
        counter += 1
        i += 5
    if counter > highest:
        highest = counter
    counter = 0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM