简体   繁体   中英

Adding up re.finditer results

Is there anyway to add up the results from different finditer like you could do with findall? for instance:

matches = re.finditer(pattern_1, text) + re.finditer(pattern_2, text)

I have several different patterns and results and I'd like to iterate over them as a single block instead of separately.

You can use itertools.chain.from_iterable .

from itertools import chain

patterns = [] # All patterns go here

for match in chain.from_iterable(re.finditer(pattern, text) for pattern in patterns):
    print match

This yields

<_sre.SRE_Match object at 0x021544B8>
<_sre.SRE_Match object at 0x021544F0>
<_sre.SRE_Match object at 0x021544B8>
<_sre.SRE_Match object at 0x021544F0>
<_sre.SRE_Match object at 0x021544B8>

Using the input from equem's answer.

You can use itertools.chain :

import itertools

for match in itertools.chain(re.finditer(pattern_1, text), re.finditer(pattern_2, text)):
    pass

There is a module chain from itertools
You can make an iterator which returns all elements of iterable objects from first to the last

import itertools 
matches = itertools.chain(re.finditer(pattern_1, text),re.finditer(pattern_2, text))
for m in matches:
    pass

Generally itertools is a gift.

Isn't it a generator function that you need,
instead of chaining generators re.finditer(pat,text) that you will be obliged to write one after the other with a different pattern for each generator, and doing again the same task for each new group of patterns ?

Here's my way:

import re

pat1 = '<tag>.+?</tag>'
pat2 = 'mount.+?@'
pat3 = '\d{3} [a-r]+'

text = 'The amount <tag>of 100 dollars was</tag> given @ me'\
       '<tag>the_mountain_and_desert : john@gmail.com'\
       'Sun is mounting @ the top of the sky'


def gmwp(text,patterns): #generates matches with patterns
    for pat in patterns:
        for m in re.finditer(pat,text):
            yield m.group() 

ps = (pat1,pat2,pat3)

for x in gmwp(text,ps):
    print x

result

<tag>of 100 dollars was</tag>
mount <tag>of 100 dollars was</tag> given @
mountain_and_desert : john@
mounting @
100 dollar

It is the presence of the keyword yield that defines the function as a generator function.

.

Edit

Consequently to the comment of Steinar Lima, I've examined this problem again.

In a sense, Steinar Lima is right, I wrote a function that acts somewhat like the chaining performed by itertools.chains() . But it isn't quite true.

In fact, my generator function doesn't yield matches as the other solutions based on chain() ; it yields matching substrings in a text because it seemed to me that if someone wants to use regexes, it's to find such matching substrings in a text, not matches. So this function does for matching substrings what chain() does for matches in the other solutions, but as I didn't succeed to do it with chain() and I don't think it may be possible to find a solution using chain() to produce the yielding of my generator function, I doesn't fully agree with his opinion that I wrote another implementation of chain() : for my particular goal, the use of chain() isn't practicable. Show me if you can.

Now, if the aim is to find a way to produce matches from a collection of patterns:

  • the code written by Simeon Visser and Deck doesn't please me because it requires that we write re.finditer(pattern_1, text), re.finditer(pattern_2, text), etc for each regex pattern, not just a collection of them.

  • the code of Steinar Lima uses such a collection, but it doesn't please me anymore because it returns iterators of matches, not matches.

  • after having put my ideas in orders, I found what I consider the real convenient solution :
    my second code uses chain() and the collection patterns to produce a yielding of Match objects.

.

import re
from pprint import pprint

text = 'The amount <tag>of 100 dollars was</tag> given @ me'\
       '<tag>the_mountain_and_desert : john@gmail.com'\
       'Sun is mounting @ the top of the sky'

pattern_1 = '<tag>.+?</tag>'
pattern_2 = 'mount.+?@'
pattern_3 = '\d{3} [a-r]+'

print 'Code of #Simeon Visser and Deck========='
import itertools
for match in itertools.chain(re.finditer(pattern_1, text),
                             re.finditer(pattern_2, text),
                             re.finditer(pattern_3, text)):
    print match # a Match object
    #pprint(list(match))

print '\nCode of #Steinar Lima =================='
from itertools import chain
patterns = [pattern_1,pattern_2,pattern_3] # All patterns go here
for match in chain(re.finditer(pattern, text) for pattern in patterns):
    print '# ',match # a re.finditer(...) iterator object
    pprint(list(match))


print '\nCode 2 of #eyquem ======================'
for match in chain(*(re.finditer(pattern, text)
                     for pattern in patterns)):
    print match # a Match object

result

Code of #Simeon Visser and Deck=========
<_sre.SRE_Match object at 0x011DB800>
<_sre.SRE_Match object at 0x011DB838>
<_sre.SRE_Match object at 0x011DB800>
<_sre.SRE_Match object at 0x011DB838>
<_sre.SRE_Match object at 0x011DB800>

Code of #Steinar Lima ==================
#  <callable-iterator object at 0x011E0B10>
[<_sre.SRE_Match object at 0x011DB800>]
#  <callable-iterator object at 0x011E0A90>
[<_sre.SRE_Match object at 0x011DB800>,
 <_sre.SRE_Match object at 0x011DB838>,
 <_sre.SRE_Match object at 0x011DB870>]
#  <callable-iterator object at 0x011E0B10>
[<_sre.SRE_Match object at 0x011DB800>]

Code 2 of #eyquem ======================
<_sre.SRE_Match object at 0x011DB800>
<_sre.SRE_Match object at 0x011DB838>
<_sre.SRE_Match object at 0x011DB800>
<_sre.SRE_Match object at 0x011DB838>
<_sre.SRE_Match object at 0x011DB800>

.

EDIT 2

So, after its modification, the code of Steinar Lima produces the same result as my own second code.

I use chain(*(........)) while he uses chain.from_iterable(.........)

I wondered if there would be any difference that would justify to preferably use one of these two ways

The following code compares the execution's times.

from time import clock

n = 5000

print '\nCode 2 of #eyquem ======================'
te = clock()
for i in xrange(n):
    for match in chain(*(re.finditer(pattern, text)
                         for pattern in patterns)):
        del match
t1 = clock()-te
print t1


print '\nCode 2 of #Steinar Lima ================'
te = clock()
for i in xrange(n):
    for match in chain.from_iterable(re.finditer(pattern, text)
                                     for pattern in patterns):
        del match
t2 = clock()-te
print t2

print '\ntLima/teyquem == {:.2%}'.format(t2/t1)
  • It seems that which one of the two codes is the faster depends on the value of n , anyway the times aren't very different from one code to the other

  • It remains the fact that my way uses less letters than the use of chain.from_iterable but it isn't determinant.

  • Another point is that personnally, I understand more easily the form chain(*(........)) : it expresses instantly that the operation takes each sequences in (.........) and chains all their elements, one after the other.
    While chain.from_iterable(.........) gives me the impression that it's the sequences in (..........) that are yielded one after the other, not their elements.
    That's subjective.

  • I found only one case in which the function chain.from_iterable presents an specific advantage, it is when someone wishes to perform the operation on several sequences of sequences themselves present in a collection.
    The following code shows what I mean

.

from pprint import pprint
from itertools import chain

li = [(1,12,85),'king',('a','bnj')]
hu = (['AB',pprint],(145,854))
ss = 'kim'

collek = (li,hu,ss)
print 'collek :'
pprint(collek)

print

for x in  map(chain.from_iterable,collek):
    print list(x),x

print

for y in collek:
    print list(chain(*y))

result

collek :
([(1, 12, 85), 'king', ('a', 'bnj')],
 (['AB', <function pprint at 0x011DDFB0>], (145, 854)),
 'kim')

[1, 12, 85, 'k', 'i', 'n', 'g', 'a', 'bnj']
['AB', <function pprint at 0x011DDFB0>, 145, 854]
['k', 'i', 'm']

[1, 12, 85, 'k', 'i', 'n', 'g', 'a', 'bnj']
['AB', <function pprint at 0x011DDFB0>, 145, 854]
['k', 'i', 'm']

The first iteration yields objects that are directly iterators, while in the second iteration the objects yielded are the elements of the collection and the chaining must be applied afterwards.
The second iteration can be written:

for y in collek:
    print list(chain.from_iterable(y))

but the first one cannot be written in an other way.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM