different behavior when using re.finditer and re.match

Question

I'm working on a regex to to collect some values from a page through some script. I'm using re.match in condition but it returns false but if i use finditer it returns true and body of condition is executed. i tested that regex in my own built tester and it's working there but not in script. here is sample script.

result = []
RE_Add0 = re.compile("\d{5}(?:(?:-| |)\d{4})?", re.IGNORECASE)
each = ''Expiration Date:\n05/31/1996\nBusiness Address: 23901 CALABASAS ROAD #2000 CALABASAS, CA 91302\n'
if RE_Add0.match(each):
    result0 = RE_Add0.match(each).group(0)
    print result0
    if len(result0) < 100:
        result.append(result0)
    else:
        print 'Address ignore'
else:
    None

Answer 1

re.finditer() returns an iterator object even if there is no match (so an if RE_Add0.finditer(each) would always return True ). You have to actually iterate over the object to see if there are actual matches.

Then, re.match() only matches at the beginning of the string, not anywhere in the string as re.search() or re.finditer() do.

Third, that regex could be written as r"\\d{5}(?:[ -]?\\d{4})" .

Fourth, always use raw strings with regexes.

Answer 2

re.match matches at the beginning of a string only once. re.finditer is similar to re.search in this regard, ie, it matches iteratively. Compare:

>>> re.match('a', 'abc')
<_sre.SRE_Match object at 0x01057AA0>
>>> re.match('b', 'abc')
>>> re.finditer('a', 'abc')
<callable_iterator object at 0x0106AD30>
>>> re.finditer('b', 'abc')
<callable_iterator object at 0x0106EA10>

ETA: Since you're mentioning page , I can only surmise that you're talking about html parsing, if that is the case, use BeautifulSoup or a similar html parser. Don't use regex.

Answer 3

Try this:

import re

postalCode = re.compile(r'((\d{5})([ -])?(\d{4})?(\s*))$')
primaryGroup = lambda x: x[1]

sampleStr = """
    Expiration Date:
    05/31/1996
    Business Address: 23901 CALABASAS ROAD #2000 CALABASAS, CA 91302  
"""
result = []

matches = list(re.findall(postalCode, sampleStr))
if matches:
    for n,match in enumerate(matches): 
        pc = primaryGroup(match)
        print pc
        result.append(pc)
else:
    print "No postal code found in this string"

This returns '12345' on any of

12345\n
12345  \n
12345 6789\n
12345 6789    \n
12345 \n
12345     \n
12345-6789\n
12345-6789    \n
12345-\n
12345-    \n
123456789\n
123456789    \n
12345\n
12345    \n

I have it matching only at the end of a line, because otherwise it was also matching '23901' (from the street address) in your example.

different behavior when using re.finditer and re.match

Question

3 answers

solution1
3 2011-01-10 12:51:56

solution2
1 2011-01-10 12:48:39

solution3
0 2011-01-10 14:55:16

different behavior when using re.finditer and re.match

Question

3 answers

solution1 3 2011-01-10 12:51:56

solution2 1 2011-01-10 12:48:39

solution3 0 2011-01-10 14:55:16

solution1
3 2011-01-10 12:51:56

solution2
1 2011-01-10 12:48:39

solution3
0 2011-01-10 14:55:16