The reg expression below
get_tags = lambda t: re.findall(r"<(.+)>", t)
st = "xyx<ab>xy x<bc> xyx<cd>xyxy xx<de> xyx <ef>x y<fg><gh>y"
print(get_tags(st))
expected output was
['ab', 'bc', 'cd', 'de', 'ef', 'fg', 'gh']
even though the pattern is not greedy (no '*' used?), the expression gives the output
['a>xyx<b>xyx<c>xyxyxx<d>xyx<e>xy<f><g']
What is the problem in the pattern?
.+
is greedy by default. You need to add ?
reluctant quantifier next to the +
to do a non-greedy match.
get_tags = lambda t: re.findall(r"<(.+?)>", t)
OR
get_tags = lambda t: re.findall(r"<([^<>]+)>", t)
[^<>]+
negated character class which matches any character but not of >
or <
one or more times.
>>> get_tags = lambda t: re.findall(r"<(.+?)>", t)
>>> st = "xyx<ab>xy x<bc> xyx<cd>xyxy xx<de> xyx <ef>x y<fg><gh>y"
>>> print(get_tags(st))
['ab', 'bc', 'cd', 'de', 'ef', 'fg', 'gh']
>>> get_tags = lambda t: re.findall(r"<([^<>]+)>", t)
>>> print(get_tags(st))
['ab', 'bc', 'cd', 'de', 'ef', 'fg', 'gh']
Since you know to find only letters between < >
you could also use
get_tags = lambda t: re.findall(r"<(\w+)>", t)
as regex. that would only search for [A-Za-z]
between < >
and since there are spaces an some different between the brackets in your example. this would also work.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.