简体   繁体   中英

Branching at the parser lever based on the content of a token

I'm working on a simple example parser/lexer for a tiny project, but I've run into a problem.

I'm parsing content along these lines:

Name SEP Gender SEP Birthday
Name SEP Gender SEP Birthday

… where SEP is any one (but not multiple!) of | , , , or whitespace.

Now, I didn't want to lock the field-order in at the lexer order, so I'm trying to lex this with a very simple set of tokens:

%token <string> SEP
%token <string> VAL
%token NL

%token EOF

Now, it's dictated that I produce a parse-error if, for instance, the gender field doesn't contain a small set of per-determined values, say {male,female,neither,unspecified} . I can wrap the parser and deal with this, but I'd really like to encode this requirement into the automaton for future expansion.

My first attempt, looking something like this, failed horribly:

doc:
   | EOF              { [] }
   | it = rev_records { it }
   ;

rev_records:
           | (* base-case: empty *) { [] }
           | rest = rev_records; record; NL  { record :: rest }
           | rest = rev_records; record; EOF { record :: rest }
           ;

record:
   last_name = name_field; SEP; first_name = name_field; SEP;
   gender = gender_field; SEP; favourite_colour = colour_field; SEP;
   birthday = date_field
   { {last_name; first_name; gender; favourite_colour; birthday} }

name_field: str = VAL { str }

gender_field:
            | VAL "male" { Person.Male }
            | VAL "female" { Person.Female }
            | VAL "neither" { Person.Neither }
            | VAL "unspecified" { Person.Unspecified }
            ;

Yeah, no dice. Obviously, my attempt at an unstructured-lexing is already going poorly.

What's the idiomatic way to parse something like this?

Parsers, such as Menhir and OCamlYacc, operate on tokens, not on strings or characters. The transformation from characters to tokens is made on the lexer level. That's why you can't specify a string in the production rule.

You can, of course, perform any check in the semantic action and raise an exception, eg,

record:
   last_name = name_field; SEP; first_name = name_field; SEP;
   gender_val = VAL; SEP; favourite_colour = colour_field; SEP;
   birthday = date_field
   { 
     let gender = match gender_val with
     | "male" -> Person.Male
     | "female" -> Person.Female
     | "neither" -> Person.Neither
     | "unspecified" -> Person.Unspecified
     | _ -> failwith "Parser error: invalid value in the gender field" in
      {last_name; first_name; gender; favourite_colour; birthday}   
    }

You can also tokenize possible gender or you can use regular expressions on the lexer level to prevent invalid fields, eg,

rule token = parser
| "male" | "female" | "neither" | "unspecified" as -> {GENDER s}
...

However, this is not recommended, as it will, in fact, turn male , female , etc into keywords, so their occurrences in other places will break your grammar.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM