Python: re.find longest sequence

Question

I have a string that is randomly generated:

polymer_str = "diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine"

I'd like to find the longest sequence of "diNCO diol" and the longest of "diNCO diamine". So in the case above the longest "diNCO diol" sequence is 1 and the longest "diNCO diamine" is 3.

How would I go about doing this using python's re module?

Thanks in advance.

EDIT:
I mean the longest number of repeats of a given string. So the longest string with "diNCO diamine" is 3:
diol diNCO diol diNCO diamine二NCO二胺NCO二醇二NCO二胺

Answer 1

Expanding on Ealdwulf 's answer :

Documentation on re.findall can be found here .

def getLongestSequenceSize(search_str, polymer_str):
    matches = re.findall(r'(?:\b%s\b\s?)+' % search_str, polymer_str)
    longest_match = max(matches)
    return longest_match.count(search_str)

This could be written as one line, but it becomes less readable in that form.

Alternative:

If polymer_str is huge, it will be more memory efficient to use re.finditer . Here's how you might go about it:

def getLongestSequenceSize(search_str, polymer_str):
    longest_match = ''
    for match in re.finditer(r'(?:\b%s\b\s?)+' % search_str, polymer_str):
        if len(match.group(0)) > len(longest_match):
            longest_match = match.group(0)
    return longest_match.count(search_str)

The biggest difference between findall and finditer is that the first returns a list object, while the second iterates over Match objects. Also, the finditer approach will be somewhat slower.

Answer 2

I think the op wants the longest contiguous sequence. You can get all contiguous sequences like: seqs = re.findall("(?:diNCO diamine)+", polymer_str)

and then find the longest.

Answer 3

import re
pat = re.compile("[^|]+")
p = "diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine".replace("diNCO diamine","|").replace(" ","")
print max(map(len,pat.split(p)))

Answer 4

One was is to use findall :

polymer_str = "diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine"
len(re.findall("diNCO diamine", polymer_str)) # returns 4.

Answer 5

Using re:

 m = re.search(r"(\bdiNCO diamine\b\s?)+", polymer_str)
 len(m.group(0)) / len("bdiNCO diamine")

Python: re.find longest sequence

Question

5 answers

solution1
9 ACCPTED 2009-07-20 20:31:51

solution2
3 2009-07-20 19:37:33

solution3
3 2009-07-21 00:25:54

solution4
0 2009-07-20 19:25:40

solution5
0

Python: re.find longest sequence

Question

5 answers

solution1 9 ACCPTED 2009-07-20 20:31:51

solution2 3 2009-07-20 19:37:33

solution3 3 2009-07-21 00:25:54

solution4 0 2009-07-20 19:25:40

solution5 0

solution1
9 ACCPTED 2009-07-20 20:31:51

solution2
3 2009-07-20 19:37:33

solution3
3 2009-07-21 00:25:54

solution4
0 2009-07-20 19:25:40

solution5
0