简体   繁体   中英

python regular expression splitting some strings only in certain orders

I have the following tokenizeAndParse(s) function that takes a string and attempts to tokenize it to an array of strings

def tokenizeAndParse(s):
    tokens = re.split(r"(\s+|assign|:=|print|\+|if|while|{|}|;|[|]|,|@|for|true|false|call|procedure|not|and|or|\(|\))", s)
    tokens = [t for t in tokens if not t.isspace() and not t == ""]
    print("hello",tokens)

Some examples of the the function

tokenizeAndParse("assign abc := [true, true, true];")
hello ['assign', 'abc', ':=', '[', 'true', ',', 'true', ',', 'true', ']', ';']

tokenizeAndParse("print 5+5;")
hello ['print', '5', '+', '5', ';']

I am running into an interesting problem, if I call the following, the 4 and the ] aren't parsed as separate tokens and I have no idea why. As demonstrated above, if it is true before the ] the function works fine.

 tokenizeAndParse("assign abc := [true, true, 4];")
 hello ['assign', 'abc', ':=', '[', 'true', ',', 'true', ',', '4]', ';']

further playing with the function demonstrates that if its a number before the ] , it will not parse correctly. What is going on here?

The reason is that you are not splitting on numbers. Replace below code line:

tokens = re.split(r"(\s+|assign|:=|print|\+|if|while|{|}|;|[|]|,|@|for|true|false|call|procedure|not|and|or|\(|\))", s)

as shown in the below lines:

>>> def tokenizeAndParse(s):
    tokens = re.split(r"(\s+|assign|:=|print|\+|if|while|{|}|;|[|]|,|@|for|true|false|call|procedure|not|and|or|\(|\)|[0-9]+)", s)
    tokens = [t for t in tokens if not t.isspace() and not t == ""]
    print("hello",tokens)

>>> tokenizeAndParse("assign abc := [true, true, 4];")
('hello', ['assign', 'abc', ':=', '[', 'true', ',', 'true', ',', '4', ']', ';'])

This will fix the issue.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM