简体   繁体   中英

How to capture anything in an ANTLR grammar?

I have a grammar that looks like a markup/markdown language. We use it to produce our textbooks.

It is something like:

[chapter Introduction]

    [section First program]

    Java is pretty cool, **we love it**, let's learn.

    Use the ::javacc:: to compile stuff.

    [title C# is also cool]

    bla bla 

    [code]

    some java code in here

    [/code]

We have this sort of [tag xxx]content[/tag] markup language. I wrote the grammar for that, but it doesn't work for all cases. My main doubt is how to capture the content inside of [code] or even [title] , which can be anything.

To capture [section blabla] , I tried the following:

secao      : '[section ' secao_nome ']';
    secao_nome : (~']'+?);

I tried (~']'+?) to capture everything but the closing tag. That was my main idea: write many regexes like that, one for each tag that I have, and make them ignore the "close tag". For example, I tried to do (~'::'+?) to capture the content of the italic (which ends with :: ).

I also tried to have a generic token for the inside content. However, I need to ignore :: , ** , and all the symbols that actually mean something depending on the context. So, my expression RAW : (~[\\n\\[\\]'**''::''__''%%'' '0-9\\"] | ':')+; doesn't work.

You can see my full grammar here. Sorry that the names are in portuguese:

grammar Tubaina;

    @header {
        package br.com.caelum.tubaina.antlr;
    }

    afc                 : capitulo conteudos+;

    capitulo            : '[chapter ' capitulo_nome ']';
    capitulo_nome       : (~']'+?)*;

    conteudos           : enter* conteudo+ enter*;
    conteudo            : (secao | texto | subsecao | label | box | codigo | lista | imagem | exercicios | index | tabela | quote | todo | note);

    secao               : '[section ' secao_nome ']';
    secao_nome          : (~'['+?);

    quote               : '[quote ' quote_texto '--' quote_autor ']';
    quote_texto         : (~'--'+?);
    quote_autor         : (~']'+?);

    tabela              : '[table "' tabela_nome '"]' tabela_linhas+;
    tabela_nome         : (~'"'+?);
    tabela_linhas       : '[row]' tabela_colunas+ '[/row]';
    tabela_colunas      : '[col]' tabela_conteudo '[/col]';
    tabela_conteudo     : conteudo;

    index               : '[index ' index_nome ']';
    index_nome          : (~']'+?);

    exercicios          : '[exercise]' questoes '[/exercise]';
    questoes            : (enter* questao_def enter*)+;
    questao_def         : '[question]' enter* questao resposta_def? enter* '[/question]';
    questao             : (conteudo | enter)+; 
    resposta_def        : enter* '[answer]' resposta '[/answer]';
    resposta            : (texto | enter)+; 

    imagem                  : '[img ' espaco* imagem_path espaco* imagem_tamanho_def? espaco* (imagem_comentario_def? | ']');
    imagem_path             : (~' '+?);
    imagem_tamanho_def      : 'w=' imagem_tamanho '%';
    imagem_tamanho          : NUMERO;
    imagem_comentario_def   : '"' imagem_comentario '"]';
    imagem_comentario       : (~'"'+?);

    lista               : lista_numerada | lista_nao_numerada;
    lista_numerada      : '[list ' lista_tipo ']' item* '[/list]';
    lista_tipo          : 'number' | 'roman' | 'letter';
    lista_nao_numerada  : '[list]' item* '[/list]';
    item                : enter* '*' texto* enter*;

    todo                : todo_comando todo_comentario ']';
    todo_comando        : '[todo ' | '[TODO ';
    todo_comentario     : (~']'+?);

    note                : '[note]' note_conteudo+ '[/note]';
    note_conteudo       : (enter* texto enter*);

    box                 : '[box ' box_titulo ']' box_conteudo+ '[/box]';
    box_conteudo        : (enter* conteudos+ enter*);
    box_titulo          :  (~']'+?);

    subsecao            : '[title ' subsecao_nome ']';
    subsecao_nome       : (~']'+?);

    label               : '[label ' label_nome ']';
    label_nome          : (~']'+?);

    codigo                  : codigo_com_linguagem | codigo_sem_linguagem | codigo_do_arquivo;
    codigo_do_arquivo       : '[code ' linguagem 'filename=' codigo_arquivo_path '[/code]';
    codigo_arquivo_path     : (~' '+?);
    codigo_raw              : (~'[/code]'+?);
    linguagem               : (~' '+?);
    codigo_sem_linguagem    : '[code]' codigo_raw '[/code]';
    codigo_com_linguagem    : '[code ' linguagem codigo_fechado codigo_raw '[/code]';
    codigo_fechado          : ' #]' | ']';

    texto               : paragrafo | negrito | italico | underline | inline;
    paragrafo           : linha enter?;
    linha               : (~'\n'+?);
    negrito             : '**' linha '**';
    italico             : '::' linha '::';
    underline           : '__' linha '__';
    inline              : '%%' linha '%%';

    enter                       : N | TAB;
    espaco                      : ESPACO;

    N                   : ['\n'];
    TAB                 : '\t';
    ESPACO : ' ';
    NUMERO : [0-9]+;

    WS                  : (' ' | '\t') -> skip;

Also, my attempt with the generic regex is here: https://github.com/mauricioaniche/tubaina-antlr-grammar/blob/f381ad0e3d1bd458922165c7166c7f1c55cea6c2/src/br/com/caelum/tubaina/antlr/Tubaina.g4

My question is: how can I write a grammar to a language like that, in which I have tags and any content inside them? Any ideas?

Thanks in advance!

I'm not sure about antlr, so I'm posting this answer that might help you with the regex idea.

You could use a regex like this:

\[code\]([\s\S]+)\[/code\]|\[title (.+)\]

Working demo

在此处输入图片说明

Match information

MATCH 1
2.  [165-180]   `C# is also cool`
MATCH 2
1.  [207-241]   `

    some java code in here

    `

I've put both regex in a compound one using OR to show you the idea. If you are able to use 2 regex then you can use the following:

\[code\]([\s\S]+)\[/code\]   <-- to capture the [code]XX[/code] content
\[title (.+)\]               <-- to capture the [title XX] content

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM