简体   繁体   English

为什么使用正则表达式finditer()而不是findall()

[英]Why use regex finditer() rather than findall()

What is the advantage of using finditer() if findall() is good enough? 如果findall()足够好,使用finditer()有什么好处? findall() returns all of the matches while finditer() returns match object which can't be processed as directly as a static list. findall()返回所有匹配,而finditer()返回匹配对象,该对象无法直接作为静态列表处理。

For example: 例如:

import re
CARRIS_REGEX = (r'<th>(\d+)</th><th>([\s\w\.\-]+)</th>'
                r'<th>(\d+:\d+)</th><th>(\d+m)</th>')
pattern = re.compile(CARRIS_REGEX, re.UNICODE)
mailbody = open("test.txt").read()
for match in pattern.finditer(mailbody):
    print(match)
print()
for match in pattern.findall(mailbody):
    print(match)

Output: 输出:

<_sre.SRE_Match object at 0x00A63758>
<_sre.SRE_Match object at 0x00A63F98>
<_sre.SRE_Match object at 0x00A63758>
<_sre.SRE_Match object at 0x00A63F98>
<_sre.SRE_Match object at 0x00A63758>
<_sre.SRE_Match object at 0x00A63F98>
<_sre.SRE_Match object at 0x00A63758>
<_sre.SRE_Match object at 0x00A63F98>

('790', 'PR. REAL', '21:06', '04m')
('758', 'PORTAS BENFICA', '21:10', '09m')
('790', 'PR. REAL', '21:14', '13m')
('758', 'PORTAS BENFICA', '21:21', '19m')
('790', 'PR. REAL', '21:29', '28m')
('758', 'PORTAS BENFICA', '21:38', '36m')
('758', 'SETE RIOS', '21:49', '47m')
('758', 'SETE RIOS', '22:09', '68m')

I ask this out of curiosity. 出于好奇,我问这个问题。

finditer() returns an iterator while findall() returns an array. finditer()返回一个迭代器,而findall()返回一个数组。 An iterator only does work when you ask it to by calling .next() . 只有在通过调用.next()来询问它时,迭代器才能正常工作。 A for loop knows to call .next() on iterators, meaning if you break from the loop early, any following matches won't be performed. for循环知道在迭代器上调用.next() ,这意味着如果你提前从循环break ,则不会执行任何后续匹配。 An array, on the other hand, needs to be fully populated, meaning every match must be found up front. 另一方面,阵列需要完全填充,这意味着必须预先找到每个匹配。

Iterators can be be far more memory and CPU efficient since they only needs to load one item at a time. 迭代器可以是更多的内存和CPU效率,因为它们一次只需要加载一个项目。 If you were matching a very large string (encyclopedias can be several hundred megabytes of text), trying to find all matches at once could cause the browser to hang while it searched and potentially run out of memory. 如果您匹配一个非常大的字符串(百科全书可能是几百兆字节的文本),尝试一次查找所有匹配项可能会导致浏览器在搜索时挂起并可能耗尽内存。

Sometimes it's superfluous to retrieve all matches. 有时候检索所有比赛是多余的。 If the number of matches is really high you could risk filling up your memory loading them all. 如果匹配的数量非常高,你可能会冒险填满你的记忆。

Using iterators or generators is an important concept in modern python. 使用迭代器或生成器是现代python中的一个重要概念。 That being said, if you have a small text (eg this web page) the optimization is minuscule. 话虽这么说,如果你有一个小文本(例如这个网页),优化是微不足道的。

Here is a related question about iterators: Performance Advantages to Iterators? 这是一个关于迭代器的相关问题:迭代器的性能优势?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM