简体   繁体   中英

Regular Expression to match all non alphanumeric characters except “<--”

I am trying to create a toy language in python but I am having difficulty with creating my lexer. The way to assign variables in my language is to use an arrow ("<--") so I want to create a token for this. I also want the ability to write floating-point numbers (3.2, 1.0 etc).

The following is currently how I am splitting the source code of my language but this splits on every non-alphanumeric character.

    word_list = re.split('(\W)', self.source_code)
    word_list = [elem for elem in word_list if (elem != '' and elem != ' ')]

For example:

Hi <-- 1.2

Goes to:

['Hi', '<', '-', '-', '1', '.', '2', '\n']

But I want it to split to

['Hi', '<--', '1.2']

I was wondering if it is possible to split on every alphanumeric character except when there is a "<--" or a floating-point number while still splitting when "<", "-", "." are on their own.

Edit

A more complex example:

DECLARE Number: INTEGER
DECLARE Hi: REAL
Hi <-- 1.2
INPUT Number

IF Number + Hi > 3
    THEN
        OUTPUT "Hello"

Should go to

['DECLARE', 'Number', ':', 'INTEGER','\n' 'DECLARE', 'Hi', ':', 'REAL','\n', 'Hi', '<--', '1.2','\n', 'INPUT', 'Number', '\n', '\n', 'IF', 'Number', '+', 'Hi', '>', '3','\n', '\t', 'THEN', '\n', '\t', '\t', 'OUTPUT', '"Hello"']

Normally lexers for programming languages have you define regular expressions for each type of token you have for the language you will be trying to parse. It then builds a finite state automata that reads the input as as string of characters and as it does, it transitions through states recognizing the various tokens. So, the approach I have taken is to try to define what each token looks like and try to match that in your input. Now, to be honest, I have not spent great effort in trying to come up with the best regular expressions for numbers and identifiers (I don't even know what your rules are). I merely wanted to show what I believe the approach you should take is:

s = 'Hi  <--  (1.2)'
word_list = re.findall(r'(\b\d+(?:\.\d*)?\b|\b\.\d+\b|\b\w+\b|<--|\s+|\W+)', s)
print(word_list)

Prints:

['Hi', '  ', '<--', '  ', '(', '1.2', ')']

The regex '(\\b\\d+(?:\\.\\d*)?\\b|\\b\\.\\d+\\b|\\b\\w+\\b|<--) is looking for choices, which are:

  1. \\b\\d+(?:\\.\\d*)? \\b Matches 123, 123., 123.45 all on a word boundary
  2. \\b\\.\\d+\\b Matches .45 on a word boundary
  3. \\b\\w+\\b Matches Hi on a word boundary
  4. <-- Matches <-- (no boundary conditions -- maybe this is not what you want)
  5. `\\s+' Matches contiguous whitespace
  6. \\W+ Matches everything else not matched above (other operators)

You need to make sure that everything is ultimately matched. You may then need to look at what has been matched (see item 6 above) to see if it is something that is actually a legal token.

So, if, for example, <-- is the only valid operator, the following would be one approach:

import re

s = 'Hi  <--  (1.2)'
for m in re.finditer(r'(?P<OK>\b\d+(?:\.\d*)?\b|\b\.\d+\b|\b\w+\b|<--)|(?P<IGNORE>\s+)|(?P<ERROR>\W+)', s):
    if m.group('ERROR') is not None:
        print('Unrecognized token', m.group('ERROR'))
    elif m.group('OK') is not None:
        print(m.group('OK'))

The various matches are tagged OK, IGNORE (white space), or ERROR. The above prints:

Hi
<--
Unrecognized token (
1.2
Unrecognized token )

By modifying Elegant Odoo's code slightly I managed to find a solution.

    split_pattern = '(<--|[\+\-\*\(\)/%:\{\},\[\]<>=(\n)(\t) ]|(?<!\d)[.](?!\d))'
    word_list = re.split(split_pattern, self.source_code)
    word_list = [elem for elem in word_list if (elem != '' and elem != ' ')]

You could try this:

import re

str_code = """
DECLARE Number: INTEGER
DECLARE Hi: REAL
Hi <-- 1.2
INPUT Number

IF Number + Hi > 3
    THEN
        OUTPUT "Hello"
"""

# split when you find group <-- or any arithmetic operator
split_pattern = re.compile('(<--|[\+\-\*\(\)\s\t\:\>\=\<])')

print([x for x in re.split(split_pattern, str_code) if x not in [' ', '']])

output:

  ['\n', 'DECLARE', 'Number', ':', 'INTEGER', '\n', 'DECLARE', 'Hi', ':', 'REAL', '\n', 'Hi', '<--', '1.2', '\n', 'INPUT', 'Number', '\n', '\n', 'IF', 'Number', '+', 'Hi', '>', '3', '\n', 'THEN', '\n', 'OUTPUT', '"Hello"', '\n']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM