简体   繁体   English

加总筛选结果

[英]Adding up re.finditer results

Is there anyway to add up the results from different finditer like you could do with findall? 无论如何,是否有可能像findall一样将不同Finditer的结果相加? for instance: 例如:

matches = re.finditer(pattern_1, text) + re.finditer(pattern_2, text)

I have several different patterns and results and I'd like to iterate over them as a single block instead of separately. 我有几种不同的模式和结果,我想将它们作为一个单独的块而不是单独的块进行迭代。

You can use itertools.chain.from_iterable . 您可以使用itertools.chain.from_iterable

from itertools import chain

patterns = [] # All patterns go here

for match in chain.from_iterable(re.finditer(pattern, text) for pattern in patterns):
    print match

This yields 这产生

<_sre.SRE_Match object at 0x021544B8>
<_sre.SRE_Match object at 0x021544F0>
<_sre.SRE_Match object at 0x021544B8>
<_sre.SRE_Match object at 0x021544F0>
<_sre.SRE_Match object at 0x021544B8>

Using the input from equem's answer. 使用来自等值答案的输入。

You can use itertools.chain : 您可以使用itertools.chain

import itertools

for match in itertools.chain(re.finditer(pattern_1, text), re.finditer(pattern_2, text)):
    pass

There is a module chain from itertools itertools提供了一个模块
You can make an iterator which returns all elements of iterable objects from first to the last 您可以创建一个迭代器,该迭代器从第一个到最后一个返回可迭代对象的所有元素

import itertools 
matches = itertools.chain(re.finditer(pattern_1, text),re.finditer(pattern_2, text))
for m in matches:
    pass

Generally itertools is a gift. 通常, itertools是礼物。

Isn't it a generator function that you need, 这不是您需要的生成器函数吗?
instead of chaining generators re.finditer(pat,text) that you will be obliged to write one after the other with a different pattern for each generator, and doing again the same task for each new group of patterns ? 而不是链接生成器re.finditer(pat,text) ,您将不得不为每个生成器一个接一个地编写具有不同模式的代码,并对每个新模式组再次执行相同的任务?

Here's my way: 这是我的方式:

import re

pat1 = '<tag>.+?</tag>'
pat2 = 'mount.+?@'
pat3 = '\d{3} [a-r]+'

text = 'The amount <tag>of 100 dollars was</tag> given @ me'\
       '<tag>the_mountain_and_desert : john@gmail.com'\
       'Sun is mounting @ the top of the sky'


def gmwp(text,patterns): #generates matches with patterns
    for pat in patterns:
        for m in re.finditer(pat,text):
            yield m.group() 

ps = (pat1,pat2,pat3)

for x in gmwp(text,ps):
    print x

result 结果

<tag>of 100 dollars was</tag>
mount <tag>of 100 dollars was</tag> given @
mountain_and_desert : john@
mounting @
100 dollar

It is the presence of the keyword yield that defines the function as a generator function. 关键字yield的存在将函数定义为生成器函数。

.

Edit 编辑

Consequently to the comment of Steinar Lima, I've examined this problem again. 因此,根据Steinar Lima的评论,我再次检查了这个问题。

In a sense, Steinar Lima is right, I wrote a function that acts somewhat like the chaining performed by itertools.chains() . 从某种意义上说,Steinar Lima是正确的,我编写了一个函数,该函数的行为有点类似于itertools.chains()执行的链接。 But it isn't quite true. 但这不是真的。

In fact, my generator function doesn't yield matches as the other solutions based on chain() ; 实际上,我的生成器函数不会像基于chain()的其他解决方案产生匹配项; it yields matching substrings in a text because it seemed to me that if someone wants to use regexes, it's to find such matching substrings in a text, not matches. 它在文本中产生匹配的子字符串,因为在我看来,如果有人想使用正则表达式,那就是在文本中找到这样的匹配子字符串,而不是匹配。 So this function does for matching substrings what chain() does for matches in the other solutions, but as I didn't succeed to do it with chain() and I don't think it may be possible to find a solution using chain() to produce the yielding of my generator function, I doesn't fully agree with his opinion that I wrote another implementation of chain() : for my particular goal, the use of chain() isn't practicable. 所以这个功能确实为匹配子什么chain()不会在其他解决方案的比赛,但我没有成功做到这一点的chain()我不认为有可能找到一个使用溶液chain()以产生我的生成器函数的收益,我并不完全同意他的观点,即我编写了chain()另一个实现:对于我的特定目标,使用chain()是不可行的。 Show me if you can. 如果可以的话,告诉我。

Now, if the aim is to find a way to produce matches from a collection of patterns: 现在,如果目标是找到一种从一组模式中产生匹配项的方法:

  • the code written by Simeon Visser and Deck doesn't please me because it requires that we write re.finditer(pattern_1, text), re.finditer(pattern_2, text), etc for each regex pattern, not just a collection of them. Simeon Visser和Deck编写的代码不能re.finditer(pattern_1, text), re.finditer(pattern_2, text), etc ,因为它要求我们为每个正则表达式模式(不仅是它们的集合re.finditer(pattern_1, text), re.finditer(pattern_2, text), etc编写re.finditer(pattern_1, text), re.finditer(pattern_2, text), etc

  • the code of Steinar Lima uses such a collection, but it doesn't please me anymore because it returns iterators of matches, not matches. Steinar Lima的代码使用了这样的集合,但是它不再让我满意,因为它返回的是匹配项而不是匹配项的迭代器。

  • after having put my ideas in orders, I found what I consider the real convenient solution : 将想法整理好之后,我发现了我认为真正方便的解决方案:
    my second code uses chain() and the collection patterns to produce a yielding of Match objects. 我的第二个代码使用chain()和收集patterns来产生Match对象的产量。

.

import re
from pprint import pprint

text = 'The amount <tag>of 100 dollars was</tag> given @ me'\
       '<tag>the_mountain_and_desert : john@gmail.com'\
       'Sun is mounting @ the top of the sky'

pattern_1 = '<tag>.+?</tag>'
pattern_2 = 'mount.+?@'
pattern_3 = '\d{3} [a-r]+'

print 'Code of #Simeon Visser and Deck========='
import itertools
for match in itertools.chain(re.finditer(pattern_1, text),
                             re.finditer(pattern_2, text),
                             re.finditer(pattern_3, text)):
    print match # a Match object
    #pprint(list(match))

print '\nCode of #Steinar Lima =================='
from itertools import chain
patterns = [pattern_1,pattern_2,pattern_3] # All patterns go here
for match in chain(re.finditer(pattern, text) for pattern in patterns):
    print '# ',match # a re.finditer(...) iterator object
    pprint(list(match))


print '\nCode 2 of #eyquem ======================'
for match in chain(*(re.finditer(pattern, text)
                     for pattern in patterns)):
    print match # a Match object

result 结果

Code of #Simeon Visser and Deck=========
<_sre.SRE_Match object at 0x011DB800>
<_sre.SRE_Match object at 0x011DB838>
<_sre.SRE_Match object at 0x011DB800>
<_sre.SRE_Match object at 0x011DB838>
<_sre.SRE_Match object at 0x011DB800>

Code of #Steinar Lima ==================
#  <callable-iterator object at 0x011E0B10>
[<_sre.SRE_Match object at 0x011DB800>]
#  <callable-iterator object at 0x011E0A90>
[<_sre.SRE_Match object at 0x011DB800>,
 <_sre.SRE_Match object at 0x011DB838>,
 <_sre.SRE_Match object at 0x011DB870>]
#  <callable-iterator object at 0x011E0B10>
[<_sre.SRE_Match object at 0x011DB800>]

Code 2 of #eyquem ======================
<_sre.SRE_Match object at 0x011DB800>
<_sre.SRE_Match object at 0x011DB838>
<_sre.SRE_Match object at 0x011DB800>
<_sre.SRE_Match object at 0x011DB838>
<_sre.SRE_Match object at 0x011DB800>

.

EDIT 2 编辑2

So, after its modification, the code of Steinar Lima produces the same result as my own second code. 因此,修改后,Steinar Lima的代码产生的结果与我自己的第二个代码相同。

I use chain(*(........)) while he uses chain.from_iterable(.........) 我使用chain(*(........))而他使用chain.from_iterable(.........)

I wondered if there would be any difference that would justify to preferably use one of these two ways 我想知道是否会有任何差异可以证明最好使用这两种方式之一

The following code compares the execution's times. 以下代码比较执行时间。

from time import clock

n = 5000

print '\nCode 2 of #eyquem ======================'
te = clock()
for i in xrange(n):
    for match in chain(*(re.finditer(pattern, text)
                         for pattern in patterns)):
        del match
t1 = clock()-te
print t1


print '\nCode 2 of #Steinar Lima ================'
te = clock()
for i in xrange(n):
    for match in chain.from_iterable(re.finditer(pattern, text)
                                     for pattern in patterns):
        del match
t2 = clock()-te
print t2

print '\ntLima/teyquem == {:.2%}'.format(t2/t1)
  • It seems that which one of the two codes is the faster depends on the value of n , anyway the times aren't very different from one code to the other 似乎两个代码中哪一个速度更快取决于n的值,无论如何,一个代码与另一个代码的时间差别并不大

  • It remains the fact that my way uses less letters than the use of chain.from_iterable but it isn't determinant. 仍然存在这样的事实,即我的方式使用的字母少于使用chain.from_iterable字母,但它不是决定因素。

  • Another point is that personnally, I understand more easily the form chain(*(........)) : it expresses instantly that the operation takes each sequences in (.........) and chains all their elements, one after the other. 另一个要点是,从个人角度来说,我更容易理解形式chain(*(........)) :它立即表示该操作采用(.........)和链中的每个序列他们所有的元素,一个接一个。
    While chain.from_iterable(.........) gives me the impression that it's the sequences in (..........) that are yielded one after the other, not their elements. 虽然chain.from_iterable(.........)给我的印象是(..........)中的序列是一个接一个地产生的,而不是它们的元素。
    That's subjective. 那是主观的。

  • I found only one case in which the function chain.from_iterable presents an specific advantage, it is when someone wishes to perform the operation on several sequences of sequences themselves present in a collection. 我发现只有一种情况,其中函数chain.from_iterable表现出特定的优势,即有人希望对集合中存在的多个序列序列执行操作。
    The following code shows what I mean 以下代码显示了我的意思

.

from pprint import pprint
from itertools import chain

li = [(1,12,85),'king',('a','bnj')]
hu = (['AB',pprint],(145,854))
ss = 'kim'

collek = (li,hu,ss)
print 'collek :'
pprint(collek)

print

for x in  map(chain.from_iterable,collek):
    print list(x),x

print

for y in collek:
    print list(chain(*y))

result 结果

collek :
([(1, 12, 85), 'king', ('a', 'bnj')],
 (['AB', <function pprint at 0x011DDFB0>], (145, 854)),
 'kim')

[1, 12, 85, 'k', 'i', 'n', 'g', 'a', 'bnj']
['AB', <function pprint at 0x011DDFB0>, 145, 854]
['k', 'i', 'm']

[1, 12, 85, 'k', 'i', 'n', 'g', 'a', 'bnj']
['AB', <function pprint at 0x011DDFB0>, 145, 854]
['k', 'i', 'm']

The first iteration yields objects that are directly iterators, while in the second iteration the objects yielded are the elements of the collection and the chaining must be applied afterwards. 第一次迭代产生的对象直接是迭代器,而在第二次迭代中,产生的对象是集合的元素,并且必须随后应用链接。
The second iteration can be written: 第二个迭代可以写成:

for y in collek:
    print list(chain.from_iterable(y))

but the first one cannot be written in an other way. 但是第一个不能用其他方式写。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM