python re.findall returns a list of tuples (strings are expected)

Question

re.findall returns a list of tuples that containing the expected strings and also something unexpected.

I was conducting a function findtags(text) to find tags in a given paragraph text . When I called re.findall(tags, text) to find defined tags in the text, it returns a list of tuple. Each tuple in the list contains the string that I expected it to return.

The function findtags(text) is as follows:

import re

def findtags(text):
    parms = '(\w+\s*=\s*"[^"]*"\s*)*'
    tags = '(<\s*\w+\s*' + parms + '\s*/?>)'
    print(re.findall(tags, text))
    return re.findall(tags, text)

testtext1 = """
My favorite website in the world is probably 
<a href="www.udacity.com">Udacity</a>. If you want 
that link to open in a <b>new tab</b> by default, you should
write <a href="www.udacity.com"target="_blank">Udacity</a>
instead!
"""

findtags(testtext1)

The expected result is

['<a href="www.udacity.com">', 
 '<b>', 
 '<a href="www.udacity.com"target="_blank">']

The actual result is

[('<a href="www.udacity.com">', 'href="www.udacity.com"'), 
 ('<b>', ''), 
 ('<a href="www.udacity.com"target="_blank">', 'target="_blank"')]

Answer 1

According to the docs for re.findall :

If one or more groups are present in the pattern, return a list of groups ; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

In your case, the stuff in parentheses in parms = '(\w+\s*=\s*"[^"]*"\s*)*' is a repeated group, so a list of tuples of possibly empty strings is returned.

Answer 2

Looks like you don't want to return your inner capture group matches, so make it a non-capturing group instead.

parms = '(?:\w+\s*=\s*"[^"]*"\s*)*'

Answer 3

re.findall return a tuple because you have two capturing group just make the params group non capturing one using ?: :

import re

def findtags(text):
    # make this non capturing group
    parms = '(?:\w+\s*=\s*"[^"]*"\s*)*'
    tags = '(<\s*\w+\s*' + parms + '\s*/?>)'
    print(re.findall(tags, text))
    return re.findall(tags, text)

testtext1 = """
My favorite website in the world is probably 
<a href="www.udacity.com">Udacity</a>. If you want 
that link to open in a <b>new tab</b> by default, you should
write <a href="www.udacity.com"target="_blank">Udacity</a>
instead!
"""

findtags(testtext1)

OUPUT:

['<a href="www.udacity.com">', '<b>', '<a href="www.udacity.com"target="_blank">']

Another why is if there is no capturing group re.findall will return matched text:

# non capturing group
parms = '(?:\w+\s*=\s*"[^"]*"\s*)*'
# no group at all
tags = '<\s*\w+\s*' + parms + '\s*/?>'

python re.findall returns a list of tuples (strings are expected)

Question

3 answers

solution1
0 2019-10-03 14:24:41

solution2
0 2019-10-03 14:24:48

solution3
0 ACCPTED 2019-10-03 14:53:09

python re.findall returns a list of tuples (strings are expected)

Question

3 answers

solution1 0 2019-10-03 14:24:41

solution2 0 2019-10-03 14:24:48

solution3 0 ACCPTED 2019-10-03 14:53:09

solution1
0 2019-10-03 14:24:41

solution2
0 2019-10-03 14:24:48

solution3
0 ACCPTED 2019-10-03 14:53:09