[英]How to capture anything in an ANTLR grammar?
I have a grammar that looks like a markup/markdown language. 我有一个看起来像标记/标记语言的语法。 We use it to produce our textbooks. 我们用它来制作我们的教科书。
It is something like: 就像这样:
[chapter Introduction] [section First program] Java is pretty cool, **we love it**, let's learn. Use the ::javacc:: to compile stuff. [title C# is also cool] bla bla [code] some java code in here [/code]
We have this sort of [tag xxx]content[/tag]
markup language. 我们有这种[tag xxx]content[/tag]
标记语言。 I wrote the grammar for that, but it doesn't work for all cases. 我为此编写了语法,但并非在所有情况下都适用。 My main doubt is how to capture the content inside of [code]
or even [title]
, which can be anything. 我的主要疑问是如何捕获[code]
甚至[title]
内部的内容,可以是任何内容。
To capture [section blabla]
, I tried the following: 为了捕获[section blabla]
,我尝试了以下操作:
secao : '[section ' secao_nome ']'; secao_nome : (~']'+?);
I tried (~']'+?)
to capture everything but the closing tag. 我试图(~']'+?)
来捕获除结束标记之外的所有内容。 That was my main idea: write many regexes like that, one for each tag that I have, and make them ignore the "close tag". 那是我的主要思想:编写许多这样的正则表达式,为我拥有的每个标记编写一个正则表达式,并使它们忽略“关闭标记”。 For example, I tried to do (~'::'+?)
to capture the content of the italic (which ends with ::
). 例如,我尝试执行(~'::'+?)
来捕获斜体的内容(以::
结束)。
I also tried to have a generic token for the inside content. 我还尝试对内部内容使用通用令牌。 However, I need to ignore ::
, **
, and all the symbols that actually mean something depending on the context. 但是,我需要忽略::
, **
和所有实际取决于上下文的符号。 So, my expression RAW : (~[\\n\\[\\]'**''::''__''%%'' '0-9\\"] | ':')+;
doesn't work. 因此,我的表达式RAW : (~[\\n\\[\\]'**''::''__''%%'' '0-9\\"] | ':')+;
不起作用。
You can see my full grammar here. 您可以在这里看到我的完整语法。 Sorry that the names are in portuguese: 对不起,名字是葡萄牙语:
grammar Tubaina; @header { package br.com.caelum.tubaina.antlr; } afc : capitulo conteudos+; capitulo : '[chapter ' capitulo_nome ']'; capitulo_nome : (~']'+?)*; conteudos : enter* conteudo+ enter*; conteudo : (secao | texto | subsecao | label | box | codigo | lista | imagem | exercicios | index | tabela | quote | todo | note); secao : '[section ' secao_nome ']'; secao_nome : (~'['+?); quote : '[quote ' quote_texto '--' quote_autor ']'; quote_texto : (~'--'+?); quote_autor : (~']'+?); tabela : '[table "' tabela_nome '"]' tabela_linhas+; tabela_nome : (~'"'+?); tabela_linhas : '[row]' tabela_colunas+ '[/row]'; tabela_colunas : '[col]' tabela_conteudo '[/col]'; tabela_conteudo : conteudo; index : '[index ' index_nome ']'; index_nome : (~']'+?); exercicios : '[exercise]' questoes '[/exercise]'; questoes : (enter* questao_def enter*)+; questao_def : '[question]' enter* questao resposta_def? enter* '[/question]'; questao : (conteudo | enter)+; resposta_def : enter* '[answer]' resposta '[/answer]'; resposta : (texto | enter)+; imagem : '[img ' espaco* imagem_path espaco* imagem_tamanho_def? espaco* (imagem_comentario_def? | ']'); imagem_path : (~' '+?); imagem_tamanho_def : 'w=' imagem_tamanho '%'; imagem_tamanho : NUMERO; imagem_comentario_def : '"' imagem_comentario '"]'; imagem_comentario : (~'"'+?); lista : lista_numerada | lista_nao_numerada; lista_numerada : '[list ' lista_tipo ']' item* '[/list]'; lista_tipo : 'number' | 'roman' | 'letter'; lista_nao_numerada : '[list]' item* '[/list]'; item : enter* '*' texto* enter*; todo : todo_comando todo_comentario ']'; todo_comando : '[todo ' | '[TODO '; todo_comentario : (~']'+?); note : '[note]' note_conteudo+ '[/note]'; note_conteudo : (enter* texto enter*); box : '[box ' box_titulo ']' box_conteudo+ '[/box]'; box_conteudo : (enter* conteudos+ enter*); box_titulo : (~']'+?); subsecao : '[title ' subsecao_nome ']'; subsecao_nome : (~']'+?); label : '[label ' label_nome ']'; label_nome : (~']'+?); codigo : codigo_com_linguagem | codigo_sem_linguagem | codigo_do_arquivo; codigo_do_arquivo : '[code ' linguagem 'filename=' codigo_arquivo_path '[/code]'; codigo_arquivo_path : (~' '+?); codigo_raw : (~'[/code]'+?); linguagem : (~' '+?); codigo_sem_linguagem : '[code]' codigo_raw '[/code]'; codigo_com_linguagem : '[code ' linguagem codigo_fechado codigo_raw '[/code]'; codigo_fechado : ' #]' | ']'; texto : paragrafo | negrito | italico | underline | inline; paragrafo : linha enter?; linha : (~'\n'+?); negrito : '**' linha '**'; italico : '::' linha '::'; underline : '__' linha '__'; inline : '%%' linha '%%'; enter : N | TAB; espaco : ESPACO; N : ['\n']; TAB : '\t'; ESPACO : ' '; NUMERO : [0-9]+; WS : (' ' | '\t') -> skip;
Also, my attempt with the generic regex is here: https://github.com/mauricioaniche/tubaina-antlr-grammar/blob/f381ad0e3d1bd458922165c7166c7f1c55cea6c2/src/br/com/caelum/tubaina/antlr/Tubaina.g4 另外,我对通用正则表达式的尝试在这里: https : //github.com/mauricioaniche/tubaina-antlr-grammar/blob/f381ad0e3d1bd458922165c7166c7f1c55cea6c2/src/br/com/caelum/tubaina/antlr/Tubaina.g4
My question is: how can I write a grammar to a language like that, in which I have tags and any content inside them? 我的问题是:如何为这样的语言编写语法,在其中我有标记和任何内容? Any ideas? 有任何想法吗?
Thanks in advance! 提前致谢!
I'm not sure about antlr, so I'm posting this answer that might help you with the regex idea. 我不确定antlr,所以我发布了这个答案,可能会对您使用正则表达式有所帮助。
You could use a regex like this: 您可以使用以下正则表达式:
\[code\]([\s\S]+)\[/code\]|\[title (.+)\]
Match information 比赛信息
MATCH 1
2. [165-180] `C# is also cool`
MATCH 2
1. [207-241] `
some java code in here
`
I've put both regex in a compound one using OR to show you the idea. 我将两个正则表达式都放在一个使用OR的复合表达式中,以向您展示这个想法。 If you are able to use 2 regex then you can use the following: 如果您能够使用2个正则表达式,则可以使用以下内容:
\[code\]([\s\S]+)\[/code\] <-- to capture the [code]XX[/code] content
\[title (.+)\] <-- to capture the [title XX] content
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.