I have a string like "F(230,24)F[f(22)_(23);(2)%[+(45)FF]]"
, where each character except for parentheses and what they enclose represents a kind of instruction. A character can be followed by an optional list of arguments specified in optional parentheses.
Such a string i would like to split the string into ['F(230,24)', 'F', '[', 'f(22)', '_(23)', ';(2)', '%', '[', '+(45)', 'F', 'F', ']', ']']
, however at the moment i only get ['F(230,24)', 'F', '[', 'f(22)_(23);(2)', '%', '[', '+(45)', 'F', 'F', ']', ']']
(a substring was not split correctly).
Currently i am using list(filter(None, re.split(r'([A-Za-z\\[\\]\\+\\-\\^\\&\\\\\\/%_;~](?!\\())', string)))
, which is just a mess of characters and a negative lookahead for (
. list(filter(None, <list>))
is used to remove empty strings from the result.
I am aware that this is likely caused by Python's re.split
having been designed not to split on a zero length match, as discussed here . However i was wondering what would be a good solution? Is there a better way than re.findall
?
Thank you.
EDIT: Unfortunately i am not allowed to use custom packages like regex
module
I am aware that this is likely caused by Python's re.split having been designed not to split on a zero length match
You can use the VERSION1
flag of the regex
module . Taking that example from the thread you've linked - see how split()
produces zero-width matches as well:
>>> import regex as re
>>> re.split(r"\s+|\b", "Split along words, preserve punctuation!", flags=re.V1)
['', 'Split', 'along', 'words', ',', 'preserve', 'punctuation', '!']
You can use re.findall
to find out all single character optionally followed by a pair of parenthesis:
import re
s = "F(230,24)F[f(22)_(23);(2)%[+(45)FF]]"
re.findall("[^()](?:\([^()]*\))?", s)
['F(230,24)',
'F',
'[',
'f(22)',
'_(23)',
';(2)',
'%',
'[',
'+(45)',
'F',
'F',
']',
']']
[^()]
match a single character except for parenthesis; (?:\\([^()]*\\))?
denotes a non-capture group( ?:
) enclosed by a pair of parenthesis and use ?
to make the group optional; Another solution. This time the pattern recognize strings with the structure SYMBOL[(NUMBER[,NUMBER...])] . The function parse_it
returns True and the tokens if the string match with the regular expression and False and empty if don't match.
import re
def parse_it(string):
'''
Input: String to parse
Output: True|False, Tokens|empty_string
'''
pattern = re.compile('[A-Za-z\[\]\+\-\^\&\\\/%_;~](?:\(\d+(?:,\d+)*\))?')
tokens = pattern.findall(string)
if ''.join(tokens) == string:
res = (True, tokens)
else:
res = (False, '')
return res
good_string = 'F(230,24)F[f(22)_(23);(2)%[+(45)FF]]'
bad_string = 'F(2a30,24)F[f(22)_(23);(2)%[+(45)FF]]' # There is an 'a' in a bad place.
print(parse_it(good_string))
print(parse_it(bad_string))
Output:
(True, ['F(230,24)', 'F', '[', 'f(22)', '_(23)', ';(2)', '%', '[', '+(45)', 'F', 'F', ']', ']'])
(False, '')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.