I am using <[^<>]+>
in order to extract substrings between <
and >
, as the following:
<abc>, <?.sdfs/>, <sdsld\\>
, etc.
I am not trying to parse HTML tags, or something similar. My only issue is extracting strings between <
and >
.
But sometimes, there might be substrings like the following:
</</\/\asa></dsdsds><sdsfsa>>
In that case, all string should be matched, instead of 3 substrings. Because all string is covered by <
and >
.
How can I modify my regex to do that?
Don't use regex. Use the traditional way to do this. Make a stack and if there's more than one '<' keep appending else break and append the whole thing.
But just make sure to handle the double back slashes that somehow crop up :-/
def find_tags(your_string)
ans = []
stack = []
tag_no = 0
for c in your_string:
if c=='<':
tag_no+=1
if tag_no>1:
stack.append(c)
elif c=='>':
if tag_no==1:
ans.append(''.join(stack))
tag_no=0
stack=[]
else:
tag_no = tag_no-1
stack.append(c)
elif tag_no>0:
stack.append(c)
return ans
Output below
find_tags(r'<abc>, <?.sdfs/>, <sdsld\>')
['abc', '?.sdfs/', 'sdsld\\']
find_tags(r'</</\/\asa></dsdsds><sdsfsa>>')
['/</\\/\\asa></dsdsds><sdsfsa>']
Note: Works in O(n) as well.
Refer this Regular Expression to match outer brackets I'm trying to implement the same using <
& >
.
Or How about a small method for this:
def recursive_bracket_parser(s, i):
while i < len(s):
if s[i] == '<':
i = recursive_bracket_parser(s, i+1)
elif s[i] == '>':
return i+1
else:
# process whatever is at s[i]
i += 1
return i
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.