简体   繁体   中英

Python: parser for syntax checking of a string format

I'm starting with Python and for a design I've to validate a string that must have this format:

aaa...a aaa...a(bbb...b) aaa...a(bbb...b)ccc...c aaa...a(bbb...b)ccc...c(ddd...d)

where aaa..a, bbb...b, ccc..c, ddd..d are integer number.

The length of the string shall be arbitrary.

There are no spaces into the string.

Only single bracket allowed.

I've approached the problem as a finite state machine with two states.

I like to know if there is a best approach to solve this task and your impressions about it and every your hint.

Just as side information I've did some test by means of regexp but this seems to me a recursive pattern validation issue and I'm not sure that can be easily do in Python, but I'm not a expert into regexp, but I suppose that if this task should be possible may be executed with one single row of code.

Tha main advantage that I can see with the fsm approach is to notify to the user where a error is located into the input string and then make more easy (from the user perspective) the checking and correction task.

[EDIT] I've discovered a wrong detecting behaviour and now the code was corrected, are not allowed two consecutive group of brackt eg 10(200)(300). Also I've reformatted the code as a function.

"""

String parser for string formatted as reported below:

aaa...a
aaa...a(bbb...b)
aaa...a(bbb...b)ccc...c(ddd...d)

where:
aaa...a, bbb...b = integer number

Not valid (some example)
()
(aaa...a)
aaa...a()
aaa...a(bbb...b)ccc...d
aaa...a((bbb....b))
"""

import sys
import re

def parse_string(buffer):
    # Checking loop
    state = 1
    old_state = 1
    next_state = 1
    strlen = len(buffer)
    initial = True
    success = False
    is_a_number = re.compile("[0-9]")
    for index, i in enumerate(buffer):

        car = i

        # State 1
        if (state == 1):
            if is_a_number.match(car):
                if (index != strlen-1):
                    # If is a number e not the last I've to wait for the next char "(" or number
                    next_state = 1
                else:
                    if (initial):
                    # If is a number and is also the last of the initial block -> I've finish to parse
                        success = True
                        break
                    else:
                        # Is the last number but not into the initial block of numbers -> error
                        success = False
                        break
            else:
                if (car == "("):
                    if (old_state == 2):
                        # Can't have two (...)(...) consecutively
                        success = False
                        break
                    if ((index == 0) or (index == strlen-1)):
                        # The ( can't be the first or the last char
                        success = False
                        break
                    else:
                        # Step to the next state
                        next_state = 2
                        initial = False
                else:
                    # Wrong char detected
                    success = False
                    break

        if (state == 2):
            if is_a_number.match(car):
                if (index != strlen-1):
                    # The char is a number and is not the last of the string
                    next_state = 2
                else:
                    # If is a number and is also the last I've a error due to a missing ")"
                    success = False
                    break
            else:
                if (car == ")"):
                    if (old_state == 1):
                        # The sequence () is not allowed
                        success = False
                        break
                    elif ((old_state == 2) and (index != strlen-1)):
                        # The previous char was a number
                        next_state = 1
                    else:
                        # I'm on the last char of the string
                        success = True
                        break
                else:
                    # Wrong char detected
                    success = False
                    break

        print("current state: "+ str(state) + " next_state: " + str(next_state))

        # Update the old and the new state
        old_state = state
        state = next_state

    return(success, state, index)

if __name__ == "__main__":

    # Get the string from the command line
    # The first argument (index = 0) is the script name, the supplied parameters start from the idex = 1
    number_cmd = len(sys.argv) - 1
    if (number_cmd != 1):
        print ("Error: request one string as input!")
        sys.exit(0)

    # Get the string
    buffer = sys.argv[1].strip()

    print("================================")
    print("Parsing: " + buffer)
    print("Checking with fsm")
    print("--------------------------------")

    # Parse the string
    success, state, index = parse_string(buffer)

    # Check result
    if (success):
        print("String validated!")
        print("================================")
    else:
        print("Syntax error detected in state: " + str(state) + "\n" + "position: " + str(buffer[:index+1]))
        print("================================")

    # Exit from script
    sys.exit(0)

Finite state machines and regular expressions are equivalent in expressive power. They both can be used to parse regular languages . So if your problem can be solved with a FSM, it can also be solved with a regular expression.

If recursive parentheses are allowed, like 1(123(345)12) , then it is not a regular language, neither FSM nor regex can parse the string. But from your description and script, I guess recursive parentheses are not allowed. Regular expression can work.

Your requirements:

  1. Parse the string and return whether the string is valid.
  2. If the string is invalid, print the location of error.
  3. The string can not start with '(', and empty parentheses '()' are not allowed.

To get the precise location of error, you cannot use one regex to match the whole string. You can use regex \\(|\\) to split the string, and [0-9]+ to match each segment. Then, you only need to make sure the parentheses match.

Here is my script:

import re

def parse_input(s):
        s = s.strip()
        digits = re.compile("[0-9]+")
        segments = re.split("(\(|\))",s)

        if not segments:
                print "Error: blank input"
                return False
        if not segments[0]: # opens with parentheses
                print "Error: cannot open with parenthese"
                return False

        in_p = False

        def get_error_context(i):
                prefix = segments[i-1] if i>0 else ""
                suffix = segments[i+1] if i<len(segments)-1 else ""
                return prefix + segments[i] + suffix

        for i, segment in enumerate(segments):
                if not segment: # blank is not allowed within parentheses
                        if in_p:
                                print "Error: empty parentheses not allowed, around '%s'"%get_error_context(i)
                                return False
                        else:
                                print "Error: no digits between ) and (, around '%s'"%get_error_context(i)
                                return False
                elif segment == "(":
                        if in_p:
                                print "Error: recursive () not allowed, around '%s'"%get_error_context(i)
                                return False
                        else:
                                in_p = True
                elif segment == ")":
                        if in_p:
                                in_p = False
                        else:
                                print "Error: ) with no matching (, around '%s'"%get_error_context(i)
                                return False
                elif not digits.match(segment):
                        print "Error: non digits, around '%s'"%get_error_context(i)
                        return False
        if in_p:
                print "Error: input ends with an open parenthese, around '%s'"%get_error_context(i)
                return False
        return True

And tests:

>>> parse_input("12(345435)4332(34)")
True
>>> parse_input("(345435)4332(34)")
Error: cannot open with parenthese
False
>>> parse_input("sdf(345435)4332()")
Error: non digits, around 'sdf('
False
>>> parse_input("123(345435)4332()")
Error: empty parentheses not allowed, around '()'
False
>>> parse_input("34324(345435)(34)")
Error: no digits between ) and (, around ')('
False
>>> parse_input("123(344332()")
Error: recursive () not allowed, around '344332('
False
>>> parse_input("12)3(3443)32(123")
Error: ) with no matching (, around '12)3'
False
>>> parse_input("123(3443)32(123")
Error: input ends with an open parenthese, around '(123'
False

This can be done with regular expression. Here is an example in python which you also can try out on regex101 :

Regular Expression Pattern: (\\d+)(\\(\\d+\\)(\\d+(\\(\\d+\\))?)?)?

And this would be the python code:

import re
p = re.compile(ur'(\d+)(\(\d+\)(\d+(\(\d+\))?)?)?')
test_str = u"1000(20)30(345)"

re.match(p, test_str)

If you want to make check for example following input 1000(20)30(345) you can add a ^ before the regex and a $ at the end.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM