简体   繁体   English

flex和bison:解析不带引号的字符串

[英]flex and bison: parse string without quotes

I'm working on a mgf file parser (syntax: http://www.matrixscience.com/help/data_file_help.html ) using flex + bison c + +. 我正在使用flex + bison c ++开发mgf文件解析器(语法: http : //www.matrixscience.com/help/data_file_help.html )。

I've realized the lexer (lex) and parser (yacc). 我已经意识到词法分析器(lex)和解析器(yacc)。 But I've a problem that I can't solve : when I try to parse strings. 但是我有一个我无法解决的问题:当我尝试解析字符串时。

Important : there is no ' or " around the string. 重要提示:字符串周围没有'或'。

Here is an example of input: 这是输入示例:

CHARGE=1+, 2+ and 3+
#some comments

BEGIN IONS
TITLE= Cmpd 1, +MSn(417.2108), 10.0 min  //line 20
PEPMASS=417.21083   35173
CHARGE=3+
123.79550   20  
285.16455   56  
302.14335   146 1+
[other datas ...]
END IONS

BEGIN IONS
[an other one ... ]

Here the (minimal) lexer: MGF_TOKEN_DEBUG is juste a macro to print a line 这里(最小)词法分析器:MGF_TOKEN_DEBUG是用于打印行的宏

#define MGF_TOKEN_DEBUG(val) std::cout<<"token: "<<val<<std::endl

\n {
    MGF_TOKEN_DEBUG("T_EOL");
    return token::T_EOL;
}

^[#;!/][^\n]* {
    MGF_TOKEN_DEBUG("T_COMMENT");
    return token::T_COMMENT;
}

[[:space:]] {}

/** values **/
[0-9]+ {
    MGF_TOKEN_DEBUG("V_INTEGER"<<" (="<<yytext<<")");
    return token::V_INTEGER;
}

[0-9]+"."[0-9]* {
   MGF_TOKEN_DEBUG("V_DOUBLE"<<" (="<<yytext<<")");
   return token::V_DOUBLE;
}

[0-9]+("."[0-9]+)?[eE][+-][0-9]+ {
    MGF_TOKEN_DEBUG("V_DOUBLE"<<" (="<<yytext<<")");
    return token::V_DOUBLE;
}

"+" {
    MGF_TOKEN_DEBUG("T_PLUS");
    return token::T_PLUS;
}


"=" {
    MGF_TOKEN_DEBUG("T_EQUALS");
    return token::T_EQUALS;
}

"," {
    MGF_TOKEN_DEBUG("T_COMA");
    return token::T_COMA;
}

"and" {
    MGF_TOKEN_DEBUG("T_AND");
    return token::T_AND;
}
/*** keywords */
^"CHARGE" {
    MGF_TOKEN_DEBUG("K_CHARGE");
    return token::K_CHARGE;
}

^"TITLE" {
    MGF_TOKEN_DEBUG("K_TITLE");
    return token::K_TITLE;
}
[ others keywords ...]

/**** string : problem here **/
[A-Za-z]([:;,()A-Za-z0-9_.-]|[[:space]])* {
    MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")");
    return token::V_STRING;
}

And the (minimized) parser. 还有(最小化的)解析器。

start : headerparams blocks T_END;

headerparams : /* empty */| headerparams headerparam;

headerparam : K_CHARGE T_EQUALS charge_list T_EOL | [others ...];

blocks : /* empty */ | blocks block;

block : T_BEGIN_IONS T_EOL blockparams ions T_END_IONS T_EOL| T_BEGIN_IONS T_EOL blockparams T_END_IONS T_EOL;

blockparam  : K_CHARGE T_EQUALS charge T_EOL | K_TITLE T_EQUALS V_STRING T_EOL | [others...];

ion : number number  T_EOL| number number charge T_EOL;

ions : ions ion| ion;

number : V_INTEGER | V_DOUBLE;

charge : V_INTEGER T_PLUS | V_INTEGER T_MINUS;

charge_list : charge| charge_list T_COMA charge | charge_list T_AND charge;

My problem is that I get the next token: 我的问题是我得到了下一个令牌:

[...]
[line 20]
token: K_TITLE
token: T_EQUALS
token: v_STRING (= Cmpd)
token: V_INTEGER (= 1)
Error line 20: syntax error, unexpected integer, expecting end of line

I would like to have: 我想拥有:

[...]
[line 20]
token: K_TITLE
token: T_EQUALS
token: v_STRING (Cmpd 1, +MSn (417.2108), 10.0 min)
token: T_EOL

If someone can help me ... 如果有人可以帮助我...


Edit #1 I've "solve" the problem using the concatenation of tokens: 编辑#1我已经使用令牌串联来“解决”问题:

lex: 法:

[A-Za-z][^\n[:space:]+-=,]* {
    MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")"))
    return token::V_STRING;t
}

yacc: YACC:

   string_st : V_STRING
      | string_st V_STRING
      | string_st number
      | string_st T_COMA
      | string_st T_PLUS
      | string_st T_MINUS
      ;

blockparam  : K_CHARGE T_EQUALS charge T_EOL | K_TITLE T_EQUALS string_st T_EOL | [others...];

if your string will alway start with some text TITLE and end with some text \\n (new line char) 如果您的字符串始终以TITLE开头,以\\n结尾(换行符)
I would suggest you to use start conditions , 我建议您使用开始条件

%x IN_TITLE

"TITLE"        { /* return V_STRING of TITILE in c++ code */ BEGIN(IN_TITLE); }
<IN_TITLE>=    { /* return T_EQUALS in c++ code */; }
<IN_TITLE>"\n" { BEGIN(INITIAL); }
<IN_TITLE>.*   { MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")");return token::V_STRING; }

%x IN_TITLE defines the IN_TITLE state, and the pattern text TITLE will make it start. %x IN_TITLE定义了IN_TITLE状态,模式文本TITLE将使其开始。 Once it's started, \\n will have it go back to the initial state (INITIAL is predefined), and every other characters will just be consumed to V_STRING without any particular action. 一旦启动, \\n将使其返回初始状态(已预定义了INITIAL),并且所有其他字符将被消耗到V_STRING而无需任何特殊操作。

Your basic problem is a simple typo: 您的基本问题是一个简单的错字:

[A-Za-z]([:;,()A-Za-z0-9_.-]|[[:space]])*

should be: 应该:

[A-Za-z]([:;,()A-Za-z0-9_.-]|[[:space:]])*
                                     ^

You don't actually need the | 您实际上不需要| operator. 运营商。 The following is perfectly legal (but probably not what you want either; see below): 以下内容完全合法(但也可能不是您想要的;请参阅下文):

[A-Za-z][[:space:]:;,()A-Za-z0-9_.-]*

Once you fix that, you'll find that you have another problem: your keywords ( TITLE , for example) will be lexed as STRING because the STRING pattern is longer. 解决此问题后,您会发现还有另一个问题:由于STRING模式更长,因此您的关键字(例如TITLE )将被归类为STRING (In fact, since [:space:] includes \\n , the STRING pattern will probably extend to the end of the input. You probably wanted [:blank:] .) (实际上,由于[:space:]包含\\n ,因此STRING模式可能会扩展到输入的末尾。您可能想要[:blank:] 。)

I took a quick glance at the description of the format you're trying to parse, but it's not a very precise description. 我快速浏览了您要解析的格式的描述,但这不是一个非常精确的描述。 But it appears that parameter lines have the format: 但似乎参数行具有以下格式:

^[[:alpha:]]+=.*$

Perhaps the :alpha: should be :alnum: or even something more permissive; 也许:alpha:应该是:alnum:甚至是更宽容的东西; as I said, the description wasn't very precise. 如我所说,描述不是很精确。 What was clear is that: 清楚的是:

  • The keyword is case-insensitive, so both TITLE and title will work identically, and 关键字不区分大小写,因此TITLEtitle都可以相同地工作,并且
  • The = sign is obligatory and may not have a space on either side of it. =号是强制性的,并且在其两侧都不能有空格。 (So your TITLE= line is not correct, but maybe it doesn't matter). (因此,您的TITLE=行不正确,但可能没关系)。

In order to not interfere with parsing of the data, you might want to make the above a single "token" whose value is the part after the = and whose type corresponds to the (case-normalized) keyword. 为了不干扰数据的解析,您可能希望使上面的单个“令牌”成为其值,该令牌是=后面的部分,并且其类型对应于(区分大小写)关键字。 Of course, each parameter-type may require an idiosyncratic value parser, which could only be achieved in flex by use of start conditions. 当然,每个参数类型可能都需要特异值解析器,这只能通过使用开始条件在flex实现。 In any event, you should think about the consequences of stray characters in the TITLE which are not part of the STRING pattern, and how you propose to deal with the resulting lexical error. 无论如何,您都应该考虑TITLE中不属于STRING模式的流浪字符的后果,以及如何建议解决由此产生的词法错误。


Your code does not make it clear how you communicate text values from your lexer to your parser. 您的代码不清楚,您如何将文本值从词法分析器传递到解析器。 You need to be aware that the value of yytext is only safe inside of the lexer action for the token it corresponds to. 您需要注意, yytext的值仅在词法分析器操作内部对其对应的标记是安全的。 The next call to the lexer will invalidate it, and bison parsers almost always have a lookahead token, so the lexer will have been called again before the token is processed. 下次对词法分析器的调用将使它无效,并且野牛解析器几乎总是具有前瞻标记,因此将在处理该标记之前再次调用词法分析器。 Consequently, you must copy yytext in order to pass it to the parser, and the parser needs to take ownership of the copy so that you don't end up leaking memory. 因此,您必须复制yytext才能将其传递给解析器,并且解析器需要获得副本的所有权,以免最终导致内存泄漏。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM