简体   繁体   中英

cs50 PSET6/DNA Regular Expressions

I'm attempting to work through finding the amount of consecutive STRs (a substring pattern, ie "AGAT") in a sequence file.

String Patterns: AGATC,TTTTTTCT,AATG,TCTAG,GATA,TATC,GAAA,TCTG

Sequence file(one of many other sequence files): AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG

In the above sequence, TATC is the maximum with a run of 5 consecutive TATC pairs. With my regular expression, it is returning matches whether they are consecutive or not.

I believe using regular expressions is my best bet. This is my first time working in Python so don't expect too much. I've used the regex tool at regex101.com and it has provided me some good insight into regex formulations. I'm passing a variable into the regex with {head}, which which is the string pattern, but I want to find the matched string {head} 2 or more times. My below regex returns a match to head at least 1 or more times, so I know why that is returning the way it does.

groups = re.findall(rf'?:{head})+, text)

If I use r"(AGAT){2,}" in regex101.com, this works the way I expect. It finds the matched string of characters 2 or more times. If I pass it into my code as groups = re.findall(rf'(?:{head}){2,}) , it doesn't return anything.

My code is below:

import csv
import re
import string


if len(sys.argv) != 3:
    print("missing command-line argument")
    exit(1)

if re.search(r"(.csv)", sys.argv[1]) == None:
    print("CSV file not found!")
    print("Usage: 'python.py *.csv *.txt'")
    exit(1)

if re.search(r"(.txt)", sys.argv[2]) == None:
    print("TXT file not found!")
    print("Usage: 'python.py *.csv *.txt'")
    exit(1)

# use reader or DictReader from the CSV module
# use sys.argv for command-line arguments
# use open(filename) and f.read() to read its contents.

# open CSV and DNA sequence and read into memory
with open(sys.argv[1], newline='') as database, open(sys.argv[2], newline='') as sequence:
    reader = csv.DictReader(database)
    headers = reader.fieldnames
    text = sequence.read()
    for head in headers:
        groups = re.findall(rf'(?:{head})+', text)
        print(head, groups)

If I use the above groups = re.findall(rf'(?:{head})+', text) variable I get the below output

AGATC ['AGATCAGATCAGATCAGATC']
TTTTTTCT []
AATG ['AATG']
TCTAG []
GATA ['GATA', 'GATA']
TATC ['TATCTATCTATCTATCTATC']
GAAA ['GAAA', 'GAAA', 'GAAA']
TCTG []

If I use groups = re.findall(rf'(?:{head}){2,}', text) I get nothing.

AGATC []
TTTTTTCT []
AATG []
TCTAG []
GATA []
TATC []
GAAA []
TCTG []

So, I suppose I'm asking, how can I use regex to find a string of characters(passed as a variable) 2 or more times?

You can use pattern ((your pattern)\\2*) in your regular expression to find largest consecutive pattern ( regex101 for pattern TATC ):

import re

seq = 'AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG'
patterns = ['AGATC','TTTTTTCT','AATG','TCTAG','GATA','TATC','GAAA','TCTG']

m = max([x for p in patterns for x in re.findall(r'(({})\2*)'.format(p), seq)], key=lambda k: len(k[0]) // len(k[1]))
print('Most repeated pattern: {}, number of repetitions {}'.format(m[1], len(m[0]) // len(m[1])))

Prints:

Most repeated pattern: TATC, number of repetitions 5

This answer was given from a user, yeahIProgram, on Reddit's cs50 subreddit.

"That's what I was referring to, but I had to look it up and you escape the braces inside the formatted string by doubling them."

So, the regular expression I was looking for was groups = re.findall(rf'(({head}){{2,}})', text) . Which in returned the below output that I was expecting.

AGATC [('AGATCAGATCAGATCAGATC', 'AGATC')]
TTTTTTCT []
AATG []
TCTAG []
GATA []
TATC [('TATCTATCTATCTATCTATC', 'TATC')]
GAAA []
TCTG []

Now, I just need to get the total number of times the string occurs and I should be well on my.

Thank you @Andrej Kesely for your input.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM