简体   繁体   English

使用Boost.Spirit Qi和Lex时的空白队长

[英]Whitespace skipper when using Boost.Spirit Qi and Lex

Let's consider following code: 我们考虑以下代码:

#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/qi.hpp>
#include <algorithm>
#include <iostream>
#include <string>
#include <utility>
#include <vector>

namespace lex = boost::spirit::lex;
namespace qi = boost::spirit::qi;

template<typename Lexer>
class expression_lexer
    : public lex::lexer<Lexer>
{
public:
    typedef lex::token_def<> operator_token_type;
    typedef lex::token_def<> value_token_type;
    typedef lex::token_def<> variable_token_type;
    typedef lex::token_def<lex::omit> parenthesis_token_type;
    typedef std::pair<parenthesis_token_type, parenthesis_token_type> parenthesis_token_pair_type;
    typedef lex::token_def<lex::omit> whitespace_token_type;

    expression_lexer()
        : operator_add('+'),
          operator_sub('-'),
          operator_mul("[x*]"),
          operator_div("[:/]"),
          value("\\d+(\\.\\d+)?"),
          variable("%(\\w+)"),
          parenthesis({
            std::make_pair(parenthesis_token_type('('), parenthesis_token_type(')')),
            std::make_pair(parenthesis_token_type('['), parenthesis_token_type(']'))
          }),
          whitespace("[ \\t]+")
    {
        this->self
            = operator_add
            | operator_sub
            | operator_mul
            | operator_div
            | value
            | variable
            ;

        std::for_each(parenthesis.cbegin(), parenthesis.cend(),
            [&](parenthesis_token_pair_type const& token_pair)
            {
                this->self += token_pair.first | token_pair.second;
            }
        );

        this->self("WS") = whitespace;
    }

    operator_token_type operator_add;
    operator_token_type operator_sub;
    operator_token_type operator_mul;
    operator_token_type operator_div;

    value_token_type value;
    variable_token_type variable;

    std::vector<parenthesis_token_pair_type> parenthesis;

    whitespace_token_type whitespace;
};

template<typename Iterator, typename Skipper>
class expression_grammar
    : public qi::grammar<Iterator, Skipper>
{
public:
    template<typename Tokens>
    explicit expression_grammar(Tokens const& tokens)
        : expression_grammar::base_type(start)
    {
        start                     %= expression >> qi::eoi;

        expression                %= sum_operand >> -(sum_operator >> expression);
        sum_operator              %= tokens.operator_add | tokens.operator_sub;
        sum_operand               %= fac_operand >> -(fac_operator >> sum_operand);
        fac_operator              %= tokens.operator_mul | tokens.operator_div;

        if(!tokens.parenthesis.empty())
            fac_operand           %= parenthesised | terminal;
        else
            fac_operand           %= terminal;

        terminal                  %= tokens.value | tokens.variable;

        if(!tokens.parenthesis.empty())
        {
            parenthesised         %= tokens.parenthesis.front().first >> expression >> tokens.parenthesis.front().second;
            std::for_each(tokens.parenthesis.cbegin() + 1, tokens.parenthesis.cend(),
                [&](typename Tokens::parenthesis_token_pair_type const& token_pair)
                {
                    parenthesised %= parenthesised.copy() | (token_pair.first >> expression >> token_pair.second);
                }
            );
        }
    }

private:
    qi::rule<Iterator, Skipper> start;
    qi::rule<Iterator, Skipper> expression;
    qi::rule<Iterator, Skipper> sum_operand;
    qi::rule<Iterator, Skipper> sum_operator;
    qi::rule<Iterator, Skipper> fac_operand;
    qi::rule<Iterator, Skipper> fac_operator;
    qi::rule<Iterator, Skipper> terminal;
    qi::rule<Iterator, Skipper> parenthesised;
};


int main()
{
    typedef lex::lexertl::token<std::string::const_iterator> token_type;
    typedef expression_lexer<lex::lexertl::lexer<token_type>> expression_lexer_type;
    typedef expression_lexer_type::iterator_type expression_lexer_iterator_type;
    typedef qi::in_state_skipper<expression_lexer_type::lexer_def> skipper_type;
    typedef expression_grammar<expression_lexer_iterator_type, skipper_type> expression_grammar_type;

    expression_lexer_type lexer;
    expression_grammar_type grammar(lexer);

    while(std::cin)
    {
        std::string line;
        std::getline(std::cin, line);

        std::string::const_iterator first = line.begin();
        std::string::const_iterator const last = line.end();

        bool const result = lex::tokenize_and_phrase_parse(first, last, lexer, grammar, qi::in_state("WS")[lexer.self]);
        if(!result)
            std::cout << "Parsing failed! Reminder: >" << std::string(first, last) << "<" << std::endl;
        else
        {
            if(first != last)
                std::cout << "Parsing succeeded! Reminder: >" << std::string(first, last) << "<" << std::endl;
            else
                std::cout << "Parsing succeeded!" << std::endl;
        }
    }
}

It is a simple parser for arithmetic expressions with values and variables. 它是一个带有值和变量的算术表达式的简单解析器。 It is build using expression_lexer for extracting tokens, and then with expression_grammar to parse the tokens. 它是使用expression_lexer构建来提取标记,然后使用expression_grammar来解析标记。

Use of lexer for such a small case might seem an overkill and probably is one. 对于如此小的案例使用词法分析器似乎有点矫枉过正,可能就是一个。 But that is the cost of simplified example. 但这是简化示例的成本。 Also note that use of lexer allows to easily define tokens with regular expression while that allows to easily define them by external code (and user provided configuration in particular). 另请注意,使用词法分析器可以轻松定义具有正则表达式的标记,同时允许通过外部代码(特别是用户提供的配置)轻松定义它们。 With the example provided it would be no issue at all to read definition of tokens from an external config file and for example allow user to change variables from %name to $name . 通过提供的示例,从外部配置文件中读取令牌的定义并且例如允许用户将变量从%name更改为$name

The code seems to be working fine (checked on Visual Studio 2013 with Boost 1.61). 代码似乎工作正常(在Visual Studio 2013上使用Boost 1.61进行检查)。 Except that I have noticed that if I provide string like 5++5 it properly fails but reports as reminder just 5 rather than +5 which means the offending + was "unrecoverably" consumed. 除了我已经注意到,如果我提供像5++5这样的字符串,它会正确地失败,但报告为仅提醒5而不是+5 ,这意味着违规+被“无法恢复”消耗。 Apparently a token that was produced but did not match grammar is in no way returned to the original input. 显然,生成但与语法不匹配的令牌绝不会返回到原始输入。 But that is not what I'm asking about. 但那不是我要问的问题。 Just a side note I realized when checking the code. 我在检查代码时意识到了这一点。

Now the problem is with whitespace skipping. 现在的问题是空白跳过。 I very much don't like how it is done. 我非常不喜欢它是如何完成的。 While I have done it this way as it seems to be the one provided by many examples including answers to questions here on StackOverflow. 虽然我这样做了,因为它似乎是许多例子提供的,包括StackOverflow上的问题答案。

The worst thing seems to be that (nowhere documented?) qi::in_state_skipper . 最糟糕的事情似乎是(没有记录?) qi::in_state_skipper Also it seems that I have to add the whitespace token like that (with a name) rather than like all the other ones as using lexer.whitespace instead of "WS" doesn't seem to work. 此外,似乎我必须添加像这样的whitespace令牌(使用名称),而不是像所有其他的一样,因为使用lexer.whitespace而不是"WS"似乎不起作用。

And finally having to "clutter" the grammar with the Skipper argument doesn't seem nice. 最后不得不用Skipper论证“混乱”语法似乎不太好。 Shouldn't I be free of it? 我不应该摆脱它吗? After all I want to make the grammar based on tokens rather than direct input and I want the whitespace to be excluded from tokens stream - it is not needed there anymore! 毕竟我想基于令牌而不是直接输入来制作语法,我希望将空格从令牌流中排除 - 不再需要它!

What other options do I have to skip whitespaces? 我还有哪些其他选项可以跳过空格? What are advantages of doing it like it is now? 这样做有什么好处呢?

For some strange reason only now I found a different question, Boost.Spirit SQL grammar/lexer failure , where some other solution to whitespace skipping is provided. 出于一些奇怪的原因,我现在发现了一个不同的问题, Boost.Spirit SQL语法/词法分析失败 ,其中提供了一些其他的空白跳过解决方案 A better one! 一个更好的!

So below is the example code reworked along the suggestions there: 以下是根据建议重新编写的示例代码:

#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/qi.hpp>
#include <algorithm>
#include <iostream>
#include <string>
#include <utility>
#include <vector>

namespace lex = boost::spirit::lex;
namespace qi = boost::spirit::qi;

template<typename Lexer>
class expression_lexer
    : public lex::lexer<Lexer>
{
public:
    typedef lex::token_def<> operator_token_type;
    typedef lex::token_def<> value_token_type;
    typedef lex::token_def<> variable_token_type;
    typedef lex::token_def<lex::omit> parenthesis_token_type;
    typedef std::pair<parenthesis_token_type, parenthesis_token_type> parenthesis_token_pair_type;
    typedef lex::token_def<lex::omit> whitespace_token_type;

    expression_lexer()
        : operator_add('+'),
          operator_sub('-'),
          operator_mul("[x*]"),
          operator_div("[:/]"),
          value("\\d+(\\.\\d+)?"),
          variable("%(\\w+)"),
          parenthesis({
            std::make_pair(parenthesis_token_type('('), parenthesis_token_type(')')),
            std::make_pair(parenthesis_token_type('['), parenthesis_token_type(']'))
          }),
          whitespace("[ \\t]+")
    {
        this->self
            += operator_add
            | operator_sub
            | operator_mul
            | operator_div
            | value
            | variable
            | whitespace [lex::_pass = lex::pass_flags::pass_ignore]
            ;

        std::for_each(parenthesis.cbegin(), parenthesis.cend(),
            [&](parenthesis_token_pair_type const& token_pair)
            {
                this->self += token_pair.first | token_pair.second;
            }
        );
    }

    operator_token_type operator_add;
    operator_token_type operator_sub;
    operator_token_type operator_mul;
    operator_token_type operator_div;

    value_token_type value;
    variable_token_type variable;

    std::vector<parenthesis_token_pair_type> parenthesis;

    whitespace_token_type whitespace;
};

template<typename Iterator>
class expression_grammar
    : public qi::grammar<Iterator>
{
public:
    template<typename Tokens>
    explicit expression_grammar(Tokens const& tokens)
        : expression_grammar::base_type(start)
    {
        start                     %= expression >> qi::eoi;

        expression                %= sum_operand >> -(sum_operator >> expression);
        sum_operator              %= tokens.operator_add | tokens.operator_sub;
        sum_operand               %= fac_operand >> -(fac_operator >> sum_operand);
        fac_operator              %= tokens.operator_mul | tokens.operator_div;

        if(!tokens.parenthesis.empty())
            fac_operand           %= parenthesised | terminal;
        else
            fac_operand           %= terminal;

        terminal                  %= tokens.value | tokens.variable;

        if(!tokens.parenthesis.empty())
        {
            parenthesised         %= tokens.parenthesis.front().first >> expression >> tokens.parenthesis.front().second;
            std::for_each(tokens.parenthesis.cbegin() + 1, tokens.parenthesis.cend(),
                [&](typename Tokens::parenthesis_token_pair_type const& token_pair)
                {
                    parenthesised %= parenthesised.copy() | (token_pair.first >> expression >> token_pair.second);
                }
            );
        }
    }

private:
    qi::rule<Iterator> start;
    qi::rule<Iterator> expression;
    qi::rule<Iterator> sum_operand;
    qi::rule<Iterator> sum_operator;
    qi::rule<Iterator> fac_operand;
    qi::rule<Iterator> fac_operator;
    qi::rule<Iterator> terminal;
    qi::rule<Iterator> parenthesised;
};


int main()
{
    typedef lex::lexertl::token<std::string::const_iterator> token_type;
    typedef expression_lexer<lex::lexertl::actor_lexer<token_type>> expression_lexer_type;
    typedef expression_lexer_type::iterator_type expression_lexer_iterator_type;
    typedef expression_grammar<expression_lexer_iterator_type> expression_grammar_type;

    expression_lexer_type lexer;
    expression_grammar_type grammar(lexer);

    while(std::cin)
    {
        std::string line;
        std::getline(std::cin, line);

        std::string::const_iterator first = line.begin();
        std::string::const_iterator const last = line.end();

        bool const result = lex::tokenize_and_parse(first, last, lexer, grammar);
        if(!result)
            std::cout << "Parsing failed! Reminder: >" << std::string(first, last) << "<" << std::endl;
        else
        {
            if(first != last)
                std::cout << "Parsing succeeded! Reminder: >" << std::string(first, last) << "<" << std::endl;
            else
                std::cout << "Parsing succeeded!" << std::endl;
        }
    }
}

The differences are following: 差异如下:

  1. whitespace token is added to lexer's self as all other tokens. whitespace标记作为所有其他标记添加到词法分析器的self中。
  2. However, an action is associated with it. 但是,一个动作与它相关联。 The action makes the lexer ignore the token. 该动作使词法分析器忽略该标记。 Which is exactly what we want. 这正是我们想要的。
  3. My expression_grammar no longer takes Skipper template argument. 我的expression_grammar不再采用Skipper模板参数。 And so it is also removed from rules. 因此它也从规则中删除。
  4. lex::lexertl::actor_lexer is used instead of lex::lexertl::lexer since now there is an action associated with a token. lex::lexertl::actor_lexer用于代替lex::lexertl::lexer因为现在有一个与令牌关联的动作。
  5. I'm calling tokenize_and_parse instead of tokenize_and_phrase_parse as I don't need to pass skipper anymore. 我正在调用tokenize_and_parse而不是tokenize_and_phrase_parse因为我不再需要通过skipper了。
  6. Also I changed first assignment to this->self in lexer from = to += as it seems more flexible (resistant to order changes). 此外,我将第一个赋值更改为this->self in lexer from = to +=因为它似乎更灵活(抵抗订单更改)。 But it doesn't affect the solution here. 但它不会影响这里的解决方案。

I'm good with this. 我很擅长这个。 It suites my needs (or better to say my taste) perfectly. 它完美地满足了我的需求(或更好地说出我的口味)。 However I wonder whether there are any other consequences of such change? 不过我想知道这种变化是否还有其他后果? Is any approach preferred in some situations? 在某些情况下,是否有任何方法? That I don't know. 我不知道。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM