[英]flex and bison: parse string without quotes
I'm working on a mgf file parser (syntax: http://www.matrixscience.com/help/data_file_help.html ) using flex + bison c + +. 我正在使用flex + bison c ++开发mgf文件解析器(语法: http : //www.matrixscience.com/help/data_file_help.html )。
I've realized the lexer (lex) and parser (yacc). 我已经意识到词法分析器(lex)和解析器(yacc)。 But I've a problem that I can't solve : when I try to parse strings.
但是我有一个我无法解决的问题:当我尝试解析字符串时。
Important : there is no ' or " around the string. 重要提示:字符串周围没有'或'。
Here is an example of input: 这是输入示例:
CHARGE=1+, 2+ and 3+
#some comments
BEGIN IONS
TITLE= Cmpd 1, +MSn(417.2108), 10.0 min //line 20
PEPMASS=417.21083 35173
CHARGE=3+
123.79550 20
285.16455 56
302.14335 146 1+
[other datas ...]
END IONS
BEGIN IONS
[an other one ... ]
Here the (minimal) lexer: MGF_TOKEN_DEBUG is juste a macro to print a line 这里(最小)词法分析器:MGF_TOKEN_DEBUG是用于打印行的宏
#define MGF_TOKEN_DEBUG(val) std::cout<<"token: "<<val<<std::endl
\n {
MGF_TOKEN_DEBUG("T_EOL");
return token::T_EOL;
}
^[#;!/][^\n]* {
MGF_TOKEN_DEBUG("T_COMMENT");
return token::T_COMMENT;
}
[[:space:]] {}
/** values **/
[0-9]+ {
MGF_TOKEN_DEBUG("V_INTEGER"<<" (="<<yytext<<")");
return token::V_INTEGER;
}
[0-9]+"."[0-9]* {
MGF_TOKEN_DEBUG("V_DOUBLE"<<" (="<<yytext<<")");
return token::V_DOUBLE;
}
[0-9]+("."[0-9]+)?[eE][+-][0-9]+ {
MGF_TOKEN_DEBUG("V_DOUBLE"<<" (="<<yytext<<")");
return token::V_DOUBLE;
}
"+" {
MGF_TOKEN_DEBUG("T_PLUS");
return token::T_PLUS;
}
"=" {
MGF_TOKEN_DEBUG("T_EQUALS");
return token::T_EQUALS;
}
"," {
MGF_TOKEN_DEBUG("T_COMA");
return token::T_COMA;
}
"and" {
MGF_TOKEN_DEBUG("T_AND");
return token::T_AND;
}
/*** keywords */
^"CHARGE" {
MGF_TOKEN_DEBUG("K_CHARGE");
return token::K_CHARGE;
}
^"TITLE" {
MGF_TOKEN_DEBUG("K_TITLE");
return token::K_TITLE;
}
[ others keywords ...]
/**** string : problem here **/
[A-Za-z]([:;,()A-Za-z0-9_.-]|[[:space]])* {
MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")");
return token::V_STRING;
}
And the (minimized) parser. 还有(最小化的)解析器。
start : headerparams blocks T_END;
headerparams : /* empty */| headerparams headerparam;
headerparam : K_CHARGE T_EQUALS charge_list T_EOL | [others ...];
blocks : /* empty */ | blocks block;
block : T_BEGIN_IONS T_EOL blockparams ions T_END_IONS T_EOL| T_BEGIN_IONS T_EOL blockparams T_END_IONS T_EOL;
blockparam : K_CHARGE T_EQUALS charge T_EOL | K_TITLE T_EQUALS V_STRING T_EOL | [others...];
ion : number number T_EOL| number number charge T_EOL;
ions : ions ion| ion;
number : V_INTEGER | V_DOUBLE;
charge : V_INTEGER T_PLUS | V_INTEGER T_MINUS;
charge_list : charge| charge_list T_COMA charge | charge_list T_AND charge;
My problem is that I get the next token: 我的问题是我得到了下一个令牌:
[...]
[line 20]
token: K_TITLE
token: T_EQUALS
token: v_STRING (= Cmpd)
token: V_INTEGER (= 1)
Error line 20: syntax error, unexpected integer, expecting end of line
I would like to have: 我想拥有:
[...]
[line 20]
token: K_TITLE
token: T_EQUALS
token: v_STRING (Cmpd 1, +MSn (417.2108), 10.0 min)
token: T_EOL
If someone can help me ... 如果有人可以帮助我...
Edit #1 I've "solve" the problem using the concatenation of tokens: 编辑#1我已经使用令牌串联来“解决”问题:
lex: 法:
[A-Za-z][^\n[:space:]+-=,]* {
MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")"))
return token::V_STRING;t
}
yacc: YACC:
string_st : V_STRING
| string_st V_STRING
| string_st number
| string_st T_COMA
| string_st T_PLUS
| string_st T_MINUS
;
blockparam : K_CHARGE T_EQUALS charge T_EOL | K_TITLE T_EQUALS string_st T_EOL | [others...];
if your string will alway start with some text TITLE
and end with some text \\n
(new line char) 如果您的字符串始终以
TITLE
开头,以\\n
结尾(换行符)
I would suggest you to use start conditions , 我建议您使用开始条件 ,
%x IN_TITLE
"TITLE" { /* return V_STRING of TITILE in c++ code */ BEGIN(IN_TITLE); }
<IN_TITLE>= { /* return T_EQUALS in c++ code */; }
<IN_TITLE>"\n" { BEGIN(INITIAL); }
<IN_TITLE>.* { MGF_TOKEN_DEBUG("V_STRING"<<" (="<<yytext<<")");return token::V_STRING; }
%x IN_TITLE
defines the IN_TITLE
state, and the pattern text TITLE
will make it start. %x
IN_TITLE
定义了IN_TITLE
状态,模式文本TITLE
将使其开始。 Once it's started, \\n
will have it go back to the initial state (INITIAL is predefined), and every other characters will just be consumed to V_STRING
without any particular action. 一旦启动,
\\n
将使其返回初始状态(已预定义了INITIAL),并且所有其他字符将被消耗到V_STRING
而无需任何特殊操作。
Your basic problem is a simple typo: 您的基本问题是一个简单的错字:
[A-Za-z]([:;,()A-Za-z0-9_.-]|[[:space]])*
should be: 应该:
[A-Za-z]([:;,()A-Za-z0-9_.-]|[[:space:]])*
^
You don't actually need the |
您实际上不需要
|
operator. 运营商。 The following is perfectly legal (but probably not what you want either; see below):
以下内容完全合法(但也可能不是您想要的;请参阅下文):
[A-Za-z][[:space:]:;,()A-Za-z0-9_.-]*
Once you fix that, you'll find that you have another problem: your keywords ( TITLE
, for example) will be lexed as STRING
because the STRING
pattern is longer. 解决此问题后,您会发现还有另一个问题:由于
STRING
模式更长,因此您的关键字(例如TITLE
)将被归类为STRING
。 (In fact, since [:space:]
includes \\n
, the STRING
pattern will probably extend to the end of the input. You probably wanted [:blank:]
.) (实际上,由于
[:space:]
包含\\n
,因此STRING
模式可能会扩展到输入的末尾。您可能想要[:blank:]
。)
I took a quick glance at the description of the format you're trying to parse, but it's not a very precise description. 我快速浏览了您要解析的格式的描述,但这不是一个非常精确的描述。 But it appears that parameter lines have the format:
但似乎参数行具有以下格式:
^[[:alpha:]]+=.*$
Perhaps the :alpha:
should be :alnum:
or even something more permissive; 也许
:alpha:
应该是:alnum:
甚至是更宽容的东西; as I said, the description wasn't very precise. 如我所说,描述不是很精确。 What was clear is that:
清楚的是:
TITLE
and title
will work identically, and TITLE
和title
都可以相同地工作,并且 =
sign is obligatory and may not have a space on either side of it. =
号是强制性的,并且在其两侧都不能有空格。 (So your TITLE=
line is not correct, but maybe it doesn't matter). TITLE=
行不正确,但可能没关系)。 In order to not interfere with parsing of the data, you might want to make the above a single "token" whose value is the part after the =
and whose type corresponds to the (case-normalized) keyword. 为了不干扰数据的解析,您可能希望使上面的单个“令牌”成为其值,该令牌是
=
后面的部分,并且其类型对应于(区分大小写)关键字。 Of course, each parameter-type may require an idiosyncratic value parser, which could only be achieved in flex
by use of start conditions. 当然,每个参数类型可能都需要特异值解析器,这只能通过使用开始条件在
flex
实现。 In any event, you should think about the consequences of stray characters in the TITLE
which are not part of the STRING
pattern, and how you propose to deal with the resulting lexical error. 无论如何,您都应该考虑
TITLE
中不属于STRING
模式的流浪字符的后果,以及如何建议解决由此产生的词法错误。
Your code does not make it clear how you communicate text values from your lexer to your parser. 您的代码不清楚,您如何将文本值从词法分析器传递到解析器。 You need to be aware that the value of
yytext
is only safe inside of the lexer action for the token it corresponds to. 您需要注意,
yytext
的值仅在词法分析器操作内部对其对应的标记是安全的。 The next call to the lexer will invalidate it, and bison parsers almost always have a lookahead token, so the lexer will have been called again before the token is processed. 下次对词法分析器的调用将使它无效,并且野牛解析器几乎总是具有前瞻标记,因此将在处理该标记之前再次调用词法分析器。 Consequently, you must copy
yytext
in order to pass it to the parser, and the parser needs to take ownership of the copy so that you don't end up leaking memory. 因此,您必须复制
yytext
才能将其传递给解析器,并且解析器需要获得副本的所有权,以免最终导致内存泄漏。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.