简体   繁体   中英

python re.findall returns a list of tuples (strings are expected)

re.findall returns a list of tuples that containing the expected strings and also something unexpected.

I was conducting a function findtags(text) to find tags in a given paragraph text . When I called re.findall(tags, text) to find defined tags in the text, it returns a list of tuple. Each tuple in the list contains the string that I expected it to return.

The function findtags(text) is as follows:

import re

def findtags(text):
    parms = '(\w+\s*=\s*"[^"]*"\s*)*'
    tags = '(<\s*\w+\s*' + parms + '\s*/?>)'
    print(re.findall(tags, text))
    return re.findall(tags, text)

testtext1 = """
My favorite website in the world is probably 
<a href="www.udacity.com">Udacity</a>. If you want 
that link to open in a <b>new tab</b> by default, you should
write <a href="www.udacity.com"target="_blank">Udacity</a>
instead!
"""

findtags(testtext1)

The expected result is

['<a href="www.udacity.com">', 
 '<b>', 
 '<a href="www.udacity.com"target="_blank">']

The actual result is

[('<a href="www.udacity.com">', 'href="www.udacity.com"'), 
 ('<b>', ''), 
 ('<a href="www.udacity.com"target="_blank">', 'target="_blank"')]

According to the docs for re.findall :

If one or more groups are present in the pattern, return a list of groups ; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

In your case, the stuff in parentheses in parms = '(\w+\s*=\s*"[^"]*"\s*)*' is a repeated group, so a list of tuples of possibly empty strings is returned.

Looks like you don't want to return your inner capture group matches, so make it a non-capturing group instead.

parms = '(?:\w+\s*=\s*"[^"]*"\s*)*'

re.findall return a tuple because you have two capturing group just make the params group non capturing one using ?: :

import re

def findtags(text):
    # make this non capturing group
    parms = '(?:\w+\s*=\s*"[^"]*"\s*)*'
    tags = '(<\s*\w+\s*' + parms + '\s*/?>)'
    print(re.findall(tags, text))
    return re.findall(tags, text)

testtext1 = """
My favorite website in the world is probably 
<a href="www.udacity.com">Udacity</a>. If you want 
that link to open in a <b>new tab</b> by default, you should
write <a href="www.udacity.com"target="_blank">Udacity</a>
instead!
"""

findtags(testtext1)

OUPUT:

['<a href="www.udacity.com">', '<b>', '<a href="www.udacity.com"target="_blank">']

Another why is if there is no capturing group re.findall will return matched text:

# non capturing group
parms = '(?:\w+\s*=\s*"[^"]*"\s*)*'
# no group at all
tags = '<\s*\w+\s*' + parms + '\s*/?>'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM