ANTLR4：解析电子邮件标题，超前工作不正常，Python目标

Question

I am trying to parse this portion of an email header: 我正在尝试解析电子邮件标头的这一部分：

Received: from server.mymailhost.com (mail.mymailhost.com [126.43.75.123]) by pilot01.cl.msu.edu (8.10.2/8.10.2) with ESMTP id NAA23597;Fri, 12 Jul 2002 16:11:20 -0400 (EDT)

I want the lexer to tokenize it into these pieces: 我希望词法分析器将其标记为以下片段：

Received: 

from  server.mymailhost.com (mail.mymailhost.com [126.43.75.123]) 

by pilot01.cl.msu.edu (8.10.2/8.10.2) 

with ESMTP 

id NAA23597

;

Fri, 12 Jul 2002 16:11:20 -0400 (EDT)

<EOF>

Here's my parser grammar: 这是我的解析器语法：

parser grammar MyParser;                

options { tokenVocab=MyLexer; }         

received : Received fromToken byToken withToken idToken SemiColon date EOF ;

fromToken : FromText ;

byToken: ByText ;

withToken : WithText ;

idToken : IdText ;

date : DateContents+ ;

Below is my lexer grammar. 以下是我的词法分析器语法。 This is the error that I get when I run ANTLR: 这是我运行ANTLR时遇到的错误：

token recognition error at: 'from server.mymailhost.com (mail.mymailhost.com [126.43.75.123]) by pilot01.cl.msu.edu (8.10.2/8.10.2) with ESMTP id NAA23597;Fri, 12 Jul 2002 16:11:20 -0400 (EDT)'

mismatched input '<EOF>' expecting FromText

Apparently the lexer is successfully getting the first token ( Received: ) but then is not getting the next token ( From: ). 显然，词法分析器已成功获取第一个令牌（ Received:但未获取下一个令牌（ From: ：）。 Note that in the lexer grammar I am using lookahead; 请注意，在词法分析器语法中，我正在使用超前模式； am I using it correctly? 我使用正确吗？ Any thoughts on what the problem is? 有什么问题的想法吗？

lexer grammar MyLexer;                  

Received : 'Received: ' ;
SemiColon : ';' ;

FromText : 'from ' .+? 
      {  
        (self.input.LA(1) == 'b') and (self.input.LA(2) == 'y')
      }? ;

ByText : 'by '.+? 
      {
        (self.input.LA(1) == 'w') and (self.input.LA(2) == 'i') and (self.input.LA(3) == 't') and (self.input.LA(4) == 'h')
      }? ;

WithText : 'with ' .+? 
      {
        (self.input.LA(1) == 'i') and (self.input.LA(2) == 'd')
      }? ;

IdText : 'id ' .+? 
      {
        (self.input.LA(1) == ';')
      }? ;

DateContents : ('Mon' | 'Tue' | 'Wed' | 'Thu' | 'Fri' | 'Sat' | 'Sun') (Letter | Number | Special)+ ;

fragment Letter :  'A'..'Z' | 'a'..'z' ;

fragment Number : '0'..'9' ;

fragment Special : ' ' | '_' | '-' | '.' | ',' | '~' | ':' | '+' | '$' | '=' | '(' | ')' | '[' | ']' | '/' ;

Whitespace : [\t\r\n]+ -> skip ;

Answer 1

After much effort I figured out the answer. 经过努力，我找到了答案。 Here is the working lexer: 这是工作的词法分析器：

lexer grammar MyLexer;                  

Received : 'Received: ' ;
SemiColon : ';' ;

FromText : 'from ' .+? 
      {(self._input.LA(1) == ord('b')) and (self._input.LA(2) == ord('y'))}?
      ;

ByText : 'by '.+? 
      {(self._input.LA(1) == ord('w')) and (self._input.LA(2) == ord('i')) and (self._input.LA(3) == ord('t')) and (self._input.LA(4) == ord('h'))}? 
      ;

WithText : 'with ' .+? 
      {(self._input.LA(1) == ord('i')) and (self._input.LA(2) == ord('d'))}? 
      ;

IdText : 'id ' .+? 
      {(self._input.LA(1) == ord(';'))}? 
      ;

DateContents : ('Mon' | 'Tue' | 'Wed' | 'Thu' | 'Fri' | 'Sat' | 'Sun') (Letter | Number | Special)+ ;

fragment Letter :  'A'..'Z' | 'a'..'z' ;

fragment Number : '0'..'9' ;

fragment Special : ' ' | '_' | '-' | '.' | ',' | '~' | ':' | '+' | '$' | '=' | '(' | ')' | '[' | ']' | '/' ;

Whitespace : [\t\r\n]+ -> skip ;

ANTLR4：解析电子邮件标题，超前工作不正常，Python目标

问题描述

1 个解决方案

解决方案1
0 已采纳 2015-08-31 21:28:04

ANTLR4：解析电子邮件标题，超前工作不正常，Python目标

问题描述

1 个解决方案

解决方案1 0 已采纳 2015-08-31 21:28:04

解决方案1
0 已采纳 2015-08-31 21:28:04