re.findall returns a list of tuples that containing the expected strings and also something unexpected.
I was conducting a function findtags(text)
to find tags
in a given paragraph text
. When I called re.findall(tags, text)
to find defined tags in the text, it returns a list of tuple. Each tuple in the list contains the string that I expected it to return.
The function findtags(text)
is as follows:
import re
def findtags(text):
parms = '(\w+\s*=\s*"[^"]*"\s*)*'
tags = '(<\s*\w+\s*' + parms + '\s*/?>)'
print(re.findall(tags, text))
return re.findall(tags, text)
testtext1 = """
My favorite website in the world is probably
<a href="www.udacity.com">Udacity</a>. If you want
that link to open in a <b>new tab</b> by default, you should
write <a href="www.udacity.com"target="_blank">Udacity</a>
instead!
"""
findtags(testtext1)
The expected result is
['<a href="www.udacity.com">',
'<b>',
'<a href="www.udacity.com"target="_blank">']
The actual result is
[('<a href="www.udacity.com">', 'href="www.udacity.com"'),
('<b>', ''),
('<a href="www.udacity.com"target="_blank">', 'target="_blank"')]
According to the docs for re.findall
:
If one or more groups are present in the pattern, return a list of groups ; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
In your case, the stuff in parentheses in parms = '(\w+\s*=\s*"[^"]*"\s*)*'
is a repeated group, so a list of tuples of possibly empty strings is returned.
Looks like you don't want to return your inner capture group matches, so make it a non-capturing group instead.
parms = '(?:\w+\s*=\s*"[^"]*"\s*)*'
re.findall
return a tuple because you have two capturing group just make the params
group non capturing one using ?:
:
import re
def findtags(text):
# make this non capturing group
parms = '(?:\w+\s*=\s*"[^"]*"\s*)*'
tags = '(<\s*\w+\s*' + parms + '\s*/?>)'
print(re.findall(tags, text))
return re.findall(tags, text)
testtext1 = """
My favorite website in the world is probably
<a href="www.udacity.com">Udacity</a>. If you want
that link to open in a <b>new tab</b> by default, you should
write <a href="www.udacity.com"target="_blank">Udacity</a>
instead!
"""
findtags(testtext1)
OUPUT:
['<a href="www.udacity.com">', '<b>', '<a href="www.udacity.com"target="_blank">']
Another why is if there is no capturing group re.findall
will return matched text:
# non capturing group
parms = '(?:\w+\s*=\s*"[^"]*"\s*)*'
# no group at all
tags = '<\s*\w+\s*' + parms + '\s*/?>'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.