简体   繁体   中英

Parsing a code block with EBNF expression

I am using CocoR to generate a java-like scanner/parser:
I'm having some troubles in creating a EBNF expression to match a codeblock:

I'm assuming a code block is surrounded by two well-known tokens: <& and &> example:

public method(int a, int b) <&  
various code  
&>  

If I define a nonterminal symbol

codeblock = "<&" {ANY} "&>"  

If the code inside the two symbols contains a '<' character the generated compiler will not handle it thus giving a syntax error.

Any hint?

Edit:

COMPILER JavaLike
CHARACTERS

nonZeroDigit  = "123456789".
digit         = '0' + nonZeroDigit .
letter        = 'A' .. 'Z' + 'a' .. 'z' + '_' + '$'.

TOKENS
ident = letter { letter | digit }.

PRODUCTIONS
JavaLike = {ClassDeclaration}.
ClassDeclaration ="class" ident ["extends" ident] "{" {VarDeclaration} {MethodDeclaration }"}" .
MethodDeclaration ="public" Type ident "("ParamList")" CodeBlock.
Codeblock = "<&" {ANY} "&>".

I have omitted some productions for the sake of simplicity.
This is my actual implementation of the grammar. The main bug is that it fails if the code in the block contains one of the symbols '>' or '&'.

Nick, late to the party here ...

A number of ways to do this:

Define tokens for <& and &> so the lexer knows about them.

You may be able to use a COMMENTS directive

COMMENTS FROM <& TO &> - quoted as CoCo expects.

Or make hack NextToken() in your scanner.frame file. Do something like this (pseudo-code):

if (Peek() == CODE_START)
{
     while (NextToken() != CODE_END)
     {
        // eat tokens
     }
}

Or can override the Read() method in the Buffer and eat at the lowest level.

HTH

You can expand the ANY term to include <& , &> , and another nonterminal (call it ANY_WITHIN_BLOCK say).

Then you just use

 ANY = "<&" | {ANY_WITHIN_BLOCK} | "&>" codeblock = "<&" {ANY_WITHIN_BLOCK} "&>" 

And then the meaning of {ANY} is unchanged if you really need it later.

Okay, I didn't know anything about CocoR and gave you a useless answer, so let's try again.

As I started to say later in the comments, I feel that the real issue is that your grammar might be too loose and not sufficiently well specified.

When I wrote the CFG for the one language I've tried to create, I ended up using a sort of "meet-in-the-middle" approach: I wrote the top-level structure AND the immediate low-level combinations of tokens first, and then worked to make them meet in the mid-level (at about the level of conditionals and control flow, I guess).

You said this language is a bit like Java, so let me just show you the first lines I would write as a first draft to describe its grammar (in pseudocode, sorry. Actually it's like yacc/bison. And here, I'm using your brackets instead of Java's):

 /* High-level stuff */ program: classes classes: main-class inner-classes inner-classes: inner-classes inner-class | /* empty */ main-class: class-modifier "class" identifier class-block inner-class: "class" identifier class-block class-block: "<&" class-decls "&>" class-decls: field-decl | method method: method-signature method-block method-block: "<&" statements "&>" statements: statements statement | /* empty */ class-modifier: "public" | "private" identifier: /* well, you know */ 

And at the same time as you do all that, figure out your immediate token combinations, like for example defining "number" as a float or an int and then creating rules for adding/subtracting/etc. them.

I don't know what your approach is so far, but you definitely want to make sure you carefully specify everything and use new rules when you want a specific structure. Don't get ridiculous with creating one-to-one rules, but never be afraid to create a new rule if it helps you organize your thoughts better.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM