Finding character position from a RegEx'd string

Question

Say, I have 'ABBACDA', and I want to find the index of where each A is.

Using Regex in Python, match = re.findall('A', 'ABBACDA') only returns a list.

Any way I can tweak this? Or should I go an entirely different route? I wanted to avoid using a for x in enumerate(str) because I also would like to check for where, say, 'BB' is and what index does it start at.

Answer 1

Use re.finditer()

example 1:

match = re.finditer('A', 'ABBACDA')
for m in match:
    print m.start(), m.end(), m.group(0)

output:

0 1 A
3 4 A
6 7 A

example 2:

match = re.finditer('BB', 'ABBACDA')
for m in match:
    print m.start(), m.end(), m.group(0)

output :

1 3 BB

Answer 2

To John Machin

« (1) in your "find" code, tof is "A", in your regex code, tof is "BB" »

No. There is a 'find' code with tof = 'A' , and there is a code comparing 'find' method and 'regex' method with tof = 'BB'. There' is not an alone 'regex' code with tof = 'BB'

The 'find' code presents this solution in an isolated manner to be clearer because in the second code its lisibility is obfsucated.

I put the definition of pat = re.compile(tof) out of the loops in order to not count the time of its creation in the measured time. As it is in the beginning of the second code, you believed that it is an entirely 'regex' code. That's not so. It's a comparison code.

You should read more thoroughly.

« (2) ch[prec:].find(tof) instead of ch.find(tof, prec) »

Yes.

In fact I already knew that the index must be put in find() and not used in the string. I found it and I already used that way in past codes, but I just had a blank in my brain.

By the way, the code is hence simpler. And secondly, as you pointed out, the corrected 'find' solution now runs faster. It runs in T seconds when the regex solution runs in 0.77 * T (0.65 * T before correction):

ch = 'jggBBjgjBBBjhgBBBBjjgBBBBBjjggBBBBBBjjjgjBBBBBBB'
tof = 'BB'
L = len(tof)

X,Y = [],[]
pat = re.compile(tof)

for essay in xrange(5):

    te = clock()
    for i in xrange(1000):
        li = []
        x = ch.find(tof)
        while x+1:
            li.append(x)
            x = ch.find(tof,x+L)
    X.append( clock()-te )


    te = clock()
    for i in xrange(1000):
        ly = [ m.start() for m in pat.finditer(ch) ]
    Y.append( clock()-te )

print li,'\n',ly,'\n'
print min(X),'\n',min(Y)

But it's not such a dramatic acceleration though: 18 % faster.

« (3) your find code considers overlapping matches, the regex code doesn't »

I don't think so. I precisely tried with 'ABBACDABBmmrteyBBgfrewBBBBBioBByt BBB ggddbBB BBbGtBBBGtbBbGT' containing 'BBBBB' to verify that the 'find' solution gives the same result as the 'regex' solution, because I do know that a regex doesn't return overlapping matches. I used L = len(tof) precisely to avoid detection of overlapping slices.

result of the above code:

[3, 8, 14, 16, 21, 23, 30, 32, 34, 41, 43, 45]
[3, 8, 14, 16, 21, 23, 30, 32, 34, 41, 43, 45]

So what does «Your find code considers overlapping matches» mean, please ?

« (4) etc etc »

That's unfair. If there are etceterae, tell which ones. If none... well, I don't know...

I think that a ratio of 1 right comment versus a false one, a dubious one and an unfair one isn't enough to downvote.

Moreover, it appears to me that on SO the imperfect answers written 1 or 2 hours after the question don't escape to downvotes, while the good similar ones are more rarely upvoted because the questionner have had his solution for a long time.

Answer 3

If you are looking for only one result, use the re.search function instead. That returns a MatchObject (http://docs.python.org/library/re.html#re.MatchObject). The object has start() and end() functions.

Answer 4

You could try timing the following to see if it is faster than the regex approach:

def multifind(needle, haystack, overlap=False):
    delta = 1 if overlap else max(1, len(needle))
    pos = 0
    find = haystack.find
    while 1:
        pos = find(needle, pos)
        if pos < 0: return
        yield pos
        pos += delta

>>> data = 'ABBACDABBmmrteyBBgfrewBBBBBioBBytA BBB ggdAdbBB BBbGtBBABGtbBbGT'
>>> list(multifind('A', data))
[0, 3, 6, 33, 42, 55]
>>> list(multifind('BB', data))
[1, 7, 15, 22, 24, 29, 35, 45, 48, 53]
>>> list(multifind('BB', data, overlap=True))
[1, 7, 15, 22, 23, 24, 25, 29, 35, 36, 45, 48, 53]
>>> list(multifind('', 'qwerty'))
[0, 1, 2, 3, 4, 5, 6]
>>>

Answer 5

Oh, I also knew the use of a generator, and I had forgotten its advantage in such a research. It's a fine remark, this one.

And the use of an alias hayfind = haystack.find to make the code faster is very tricky too.

Now the 'find with generator' method runs in T and 'regex' method runs in only 0.96 * T

ch = 'jggBBjgjBBBjhgBBBBjjgBBBBBjjggBBBBBBjjjgjBBBBBBB'
tof = 'BB'
L = len(tof)

X,Y = [],[]
pat = re.compile(tof)

def multifind(needle, haystack):
    delta = len(needle)
    pos = 0
    hayfind = haystack.find
    while 1:
        pos = hayfind(needle, pos)
        if pos < 0:  return
        yield pos
        pos += delta

for essay in xrange(20):

    te = clock()
    for i in xrange(10000):
        li = list(multifind(tof,ch))
    X.append( clock()-te )


    te = clock()
    for i in xrange(10000):
        ly = [ m.start() for m in pat.finditer(ch) ]
    Y.append( clock()-te )

print li,'\n',ly,'\n'
print min(X),'\n',min(Y)

The two methods are of equivalent speeds. But I finaly prefer the 'regex' solution because it can be modified to obtain more results if needed (start and end, other conditions...) while 'find' solution only give the positions.

Answer 6

I likes regexes but I searched another solution :

ch = 'ABBACDABBmmrteyBBgfrewBBBBBioBBytA BBB ggdAdbBB BBbGtBBABGtbBbGT'
tof = 'A'

li, prec = [], 0
x = ch[prec:].find(tof)

while x+1:
    li.append(prec+x)
    prec += x+1
    x = ch[prec:].find(tof)

print li

result

[0, 3, 6, 33, 42, 55]

I measured the speeds and I am astonished that the solution with regex runs in 0,65 * T when solution with find() runs in T seconds:

ch = 'ABBACDABBmmrteyBBgfrewBBBBBioBByt BBB ggddbBB BBbGtBBBGtbBbGT'

tof = 'BB'

X,Y = [],[]
pat = re.compile(tof)

for essay in xrange(50):

    te = clock()
    for i in xrange(1000):
        li, prec, L = [], 0, len(tof)
        x = ch[prec:].find(tof)
        while x+1:
            li.append(prec+x)
            prec += x+L
            x = ch[prec:].find(tof)
    X.append( clock()-te )

    te = clock()
    for i in xrange(1000):
        ly = [ m.start() for m in pat.finditer(ch) ]
    Y.append( clock()-te )

print li
print ly
print
print min(X)
print min(Y)

Finding character position from a RegEx'd string

Question

6 answers

solution1
5 ACCPTED 2011-01-20 00:31:32

solution2
2 2011-01-20 03:55:33

solution3
0 2011-01-20 00:34:12

solution4
0 2011-01-20 02:26:11

solution5
0 2011-01-20 04:42:04

solution6
-1 2011-01-20 01:51:18

Finding character position from a RegEx'd string

Question

6 answers

solution1 5 ACCPTED 2011-01-20 00:31:32

solution2 2 2011-01-20 03:55:33

solution3 0 2011-01-20 00:34:12

solution4 0 2011-01-20 02:26:11

solution5 0 2011-01-20 04:42:04

solution6 -1 2011-01-20 01:51:18

solution1
5 ACCPTED 2011-01-20 00:31:32

solution2
2 2011-01-20 03:55:33

solution3
0 2011-01-20 00:34:12

solution4
0 2011-01-20 02:26:11

solution5
0 2011-01-20 04:42:04

solution6
-1 2011-01-20 01:51:18