简体   繁体   中英

Searching for how many times a word is occur consecutively in a string w/ Python (PSET6 CS50)

my goal is reading some strings (parts of DNA in this content) from a csv file, and then search another txt file for how many times those strings occur consecutively in those string but my current code creates an infinite loop(I did it that so way since I could not come up with a proper condition for while). Any help is appreciated thanks.

My idea was: Search the goal string if it is in, double its number if that's in too triple an increment the number until it is not in the readed anymore.

#Header line of csv : name,AGATC,AATG,TATC
# so checkstr = [AGATC,AATG,TATC] 
#Example of searched strings `GCTAAATTTGTTCAGCCAGATGTAGGCTTACAAATCAAGCTGTCCGCTCGGCACGGCCTACACACGTCGTGTAACTACAACAGCTAGTTAATCTGGATATCACCATGACCGAATCATAGATTTCGCCTTAAGGAGCTTTACCATGGCTTGGGATCCAATACTAAGGGCTCGACCTAGGCGAATGAGTTTCAGGTTGGCAATCAGCAACGCTCGCCATCCGGACGACGGCTTACAGTTAGTAGCATAGTACGCGATTTTCGGGAAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGTATCTATCTATCTATCTATCT`

For example should be able to find how many times consecutively AGATC occurs in that string and return that or record to memory.

checkstr = [] #global array that tells us what str to read
def readtxt(csvfile,seq):
    with open(f'{csvfile}','r') as p:#finding which str to read from header line of the csv
        header = csv.reader(p)
        for row in header:
            checkstr = row[1:]
            break
    with open(f'{seq}','r') as f:#searching the text for strs
        readed = f.read()
        for j in checkstr:
            n = 1
            jnew = n * j
            while True:
                if jnew in readed:
                    n += 1
                    print(f"{jnew} and {n}")
                    break
                else:
                    break

This operates on the idea that splitting a string by a substring will return an empty string on consecutive substrings. Such as:

s = 'abbcd'
s.split('b')
['a', '', 'cd']

In this case the number of consecutive b in abbcd is the count of empty strings plus 1 (2 in this case).

Expanding upon that we can use itertools groupby to count the number of times each group of text in the split string occurs, which as a result of the previous code means if we count the number of times '' occurs in the list and add one we will get your answer. The try/except statment is to handle instances where your substring is not in the string, and the resulting count is empty.

from itertools import groupby

checkstr = ['AGATC', 'AATG', 'TATC']
s = 'GCTAAATTTGTTCAGCCAGATGTAGGCTTACAAATCAAGCTGTCCGCTCGGCACGGCCTACACACGTCGTGTAACTACAACAGCTAGTTAATCTGGATATCACCATGACCGAATCATAGATTTCGCCTTAAGGAGCTTTACCATGGCTTGGGATCCAATACTAAGGGCTCGACCTAGGCGAATGAGTTTCAGGTTGGCAATCAGCAACGCTCGCCATCCGGACGACGGCTTACAGTTAGTAGCATAGTACGCGATTTTCGGGAAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGTATCTATCTATCTATCTATCT'
for c in checkstr:
    groups = groupby(s.split(c))
    try:
        print(c,[sum(1 for _ in group)+1 for label, group in groups if label==''][0])
    except IndexError:
        print(c,0)

Output

AGATC 0
AATG 43
TATC 5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM