简体   繁体   中英

Searching multiple times a pattern in a string via regex in python

When using the regex .search() I found that it matches only the first time a pattern occurs in a string, and to find all the recurrence of that pattern in the string .findall() is needed.

So, my question is: giving two different strings that "talks" to each other, i need to find each occurrences of a specific pattern in a string, then grab the position of this pattern and take the elements in that positions from the first string, then print them or save in a new list.

To be more clear i'll provide an example:

ACGCUGAGAGGACGAUGCGGACGUGCUUAGGACGUUCACACGGUGGAAGUUCACAACAAGCAGACGACUCGCUGAGGAUCCGAGAUUGCUCGCGAUCGG

...((.((....(((..((....(((((.((((.(((((...))))).)))).....)))))..))..))))).))((((((((....)))).))))..

These are the two strings, first with letters, second with dots and brackets. The pattern I want to find, compiled by regex is "((.+))". Once the pattern is found on the second string, then grab the position of the pattern and return the correspective elements of string number one. With these input i'd expect 2 different output: CACGG and GAUUGC.

To date the code i have written is like: for line in file:

 if (line[0] == "A") or (line[0] == "C") or (line[0] == "T") or (line[0] == "G"): 
    apt.append(line) 
    count = count + 1 
 else: 
    line = line.strip() 
    pattern = "(\(\.+\))" 
    match = re.search(pattern, line) 
    if match: 
       loop.append(apt[count][match.start():match.end()]) 
    else: 
       continue

This obviously retrieves only the first match of the pattern that occurs in the second line of the file, giving only CACGG as output.

How can I modify the code in order to retrieve also the second occurrence of the pattern?

thankyou, any help appreciated

If you don't mind using re.finditer :

>>> import re

>>> str1 = "ACGCUGAGAGGACGAUGCGGACGUGCUUAGGACGUUCACACGGUGGAAGUUCACAACAAGCAGACGACUCGCUGAGGAUCCGAGAUUGCUCGCGAUCGG"
>>> str2 = "...((.((....(((..((....(((((.((((.(((((...))))).)))).....)))))..))..))))).))((((((((....)))).)))).."

>>> pat = re.compile(r"\([^()]+\)")

>>> for m in pat.finditer(str2):
...     print '%02d-%02d: %s' % (m.start(), m.end(), m.group())
...     print str1[m.start():m.end()]

38-43: (...)
CACGG
83-89: (....)
GAUUGC

ideone demo

The regex \\([^()]+\\) gets the part in parentheses that doesn't have any more parentheses inside. [^()] by the way is a negated class that doesn't match any parentheses.

You could also use the pattern: \\(\\.+\\) by the way.


In your case, it could be something like:

if (line[0] == "A") or (line[0] == "C") or (line[0] == "T") or (line[0] == "G"): 
    apt.append(line) 
    count = count + 1 
else: 
    line = line.strip() 
    pattern = r"\(\.+\)" 
    for match in pattern.finditer(line):
        loop.append(apt[count][match.start():match.end()])

It will be faster if you compile the pattern before reading the file.

I cannot test this code, but here, keep in mind that each piece found will be appended to loop .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM