简体   繁体   English

用元素混合物解析化学式

[英]parsing chemical formula with mixtures of elements

I would like to use boost::spirit in order to extract the stoichiometry of compounds made of several elements from a brute formula. 我想使用boost :: spirit来提取由粗糙配方中的几种元素组成的化合物的化学计量。 Within a given compound, my parser should be able to distinguish three kind of chemical element patterns: 在给定的化合物中,我的解析器应该能够区分三种化学元素模式:

  • natural element made of a mixture of isotopes in natural abundance 天然元素由天然丰富的同位素混合物组成
  • pure isotope 纯同位素
  • mixture of isotopes in non-natural abundance 非天然丰度的同位素混合物

Those patterns are then used to parse such following compounds: 然后使用这些模式来解析以下化合物:

  • "C" --> natural carbon made of C[12] and C[13] in natural abundance “C” - >由天然丰富的C [12]和C [13]制成的天然碳
  • "CH4" --> methane made of natural carbon and hydrogen “CH4” - >甲烷由天然碳和氢制成
  • "C2H{H[1](0.8)H[2](0.2)}6" --> ethane made of natural C and non-natural H made of 80% of hydrogen and 20% of deuterium “C2H {H [1](0.8)H [2](0.2)} 6” - >乙烷由天然C和非天然H制成,由80%的氢和20%的氘组成
  • "U[235]" --> pure uranium 235 “U [235]” - >纯铀235

Obviously, the chemical element patterns can be in any order (eg CH[1]4 and H[1]4C ...) and frequencies. 显然,化学元素模式可以是任何顺序(例如CH [1] 4和H [1] 4C ......)和频率。

I wrote my parser which is quite close to do the job but I still face one problem. 我编写的解析器非常接近完成工作,但我仍面临一个问题。

Here is my code: 这是我的代码:

template <typename Iterator>
struct ChemicalFormulaParser : qi::grammar<Iterator,isotopesMixture(),qi::locals<isotopesMixture,double>>
{
    ChemicalFormulaParser(): ChemicalFormulaParser::base_type(_start)
    {

        namespace phx = boost::phoenix;

        // Semantic action for handling the case of pure isotope    
        phx::function<PureIsotopeBuilder> const build_pure_isotope = PureIsotopeBuilder();
        // Semantic action for handling the case of pure isotope mixture   
        phx::function<IsotopesMixtureBuilder> const build_isotopes_mixture = IsotopesMixtureBuilder();
        // Semantic action for handling the case of natural element   
        phx::function<NaturalElementBuilder> const build_natural_element = NaturalElementBuilder();

        phx::function<UpdateElement> const update_element = UpdateElement();

        // XML database that store all the isotopes of the periodical table
        ChemicalDatabaseManager<Isotope>* imgr=ChemicalDatabaseManager<Isotope>::Instance();
        const auto& isotopeDatabase=imgr->getDatabase();
        // Loop over the database to the spirit symbols for the isotopes names (e.g. H[1],C[14]) and the elements (e.g. H,C)
        for (const auto& isotope : isotopeDatabase) {
            _isotopeNames.add(isotope.second.getName(),isotope.second.getName());
            _elementSymbols.add(isotope.second.getProperty<std::string>("symbol"),isotope.second.getProperty<std::string>("symbol"));
        }

        _mixtureToken = "{" >> +(_isotopeNames >> "(" >> qi::double_ >> ")") >> "}";
        _isotopesMixtureToken = (_elementSymbols[qi::_a=qi::_1] >> _mixtureToken[qi::_b=qi::_1])[qi::_pass=build_isotopes_mixture(qi::_val,qi::_a,qi::_b)];

        _pureIsotopeToken = (_isotopeNames[qi::_a=qi::_1])[qi::_pass=build_pure_isotope(qi::_val,qi::_a)];
        _naturalElementToken = (_elementSymbols[qi::_a=qi::_1])[qi::_pass=build_natural_element(qi::_val,qi::_a)];

        _start = +( ( (_isotopesMixtureToken | _pureIsotopeToken | _naturalElementToken)[qi::_a=qi::_1] >>
                      (qi::double_|qi::attr(1.0))[qi::_b=qi::_1])[qi::_pass=update_element(qi::_val,qi::_a,qi::_b)] );

    }

    //! Defines the rule for matching a prefix
    qi::symbols<char,std::string> _isotopeNames;
    qi::symbols<char,std::string> _elementSymbols;

    qi::rule<Iterator,isotopesMixture()> _mixtureToken;
    qi::rule<Iterator,isotopesMixture(),qi::locals<std::string,isotopesMixture>> _isotopesMixtureToken;

    qi::rule<Iterator,isotopesMixture(),qi::locals<std::string>> _pureIsotopeToken;
    qi::rule<Iterator,isotopesMixture(),qi::locals<std::string>> _naturalElementToken;

    qi::rule<Iterator,isotopesMixture(),qi::locals<isotopesMixture,double>> _start;
};

Basically each separate element pattern can be parsed properly with their respective semantic action which produces as ouput a map between the isotopes that builds the compound and their corresponding stoichiometry. 基本上,每个单独的元素模式可以用它们各自的语义动作适当地解析,这产生了作为输出的构建化合物的同位素和它们相应的化学计量之间的映射。 The problem starts when parsing the following compound: 解析以下化合物时问题开始:

CH{H[1](0.9)H[2](0.4)}

In such case the semantic action build_isotopes_mixture return false because 0.9+0.4 is non sense for a sum of ratio. 在这种情况下,语义动作build_isotopes_mixture返回false,因为0.9 + 0.4对于比率之和是无意义的。 Hence I would have expected and wanted my parser to fail for this compound. 因此,我本来期望并希望我的解析器失败了。 However, because of the _start rule which uses alternative operator for the three kind of chemical element pattern, the parser manages to parse it by 1) throwing away the {H[1](0.9)H[2](0.4)} part 2) keeping the preceding H 3) parsing it using the _naturalElementToken . 但是,由于_start规则对三种化学元素模式使用替代运算符,解析器设法通过1)丢弃{H[1](0.9)H[2](0.4)}第2部分来解析它保持前面的H 3)使用_naturalElementToken解析它。 Is my grammar not clear enough for being expressed as a parser ? 我的语法不够清晰,无法表达为解析器吗? How to use the alternative operator in such a way that, when an occurrence has been found but gave a false when running the semantic action, the parser stops ? 如何以这样一种方式使用替代运算符:当一个事件被发现在运行语义动作时给出false时,解析器会停止?

How to use the alternative operator in such a way that, when an occurrence has been found but gave a false when running the semantic action, the parser stops ? 如何以这样一种方式使用替代运算符:当一个事件被发现但在运行语义动作时给出错误时,解析器会停止?

In general, you achieve this by adding an expectation point to prevent backtracking. 通常,您可以通过添加期望点来防止回溯来实现此目的。

In this case you are actually "conflating" several tasks: 在这种情况下,您实际上是“混淆”了几个任务:

  1. matching input 匹配输入
  2. interpreting matched input 解释匹配的输入
  3. validating matched input 验证匹配的输入

Spirit excels at matching input, has great facilities when it comes to interpreting (mostly in the sense of AST creation). Spirit擅长匹配输入,在解释时具有很好的设施(主要是AST创建意义上的)。 However, things get "nasty" with validating on the fly. 然而,事情在飞行中验证会变得“令人讨厌”。

An advice I often repeat is to consider separating the concerns whenever possible. 我经常重复的建议是尽可能考虑分离问题 I'd consider 我考虑一下

  1. building a direct AST representation of the input first, 首先构建输入的直接AST表示,
  2. transforming/normalizing/expanding/canonicalizing to a more convenient or meaningful domain representation 转换/规范化/扩展/规范化到更方便或有意义的域表示
  3. do final validations on the result 对结果做最后的验证

This gives you the most expressive code while keeping it highly maintainable. 这为您提供了最具表现力的代码,同时保持高度可维护性。

Because I don't understand the problem domain well enough and the code sample is not nearly complete enough to induce it, I will not try to give a full sample of what I have in mind. 因为我不能很好地理解问题域并且代码示例不够完整而无法引发它,所以我不会尝试提供我想到的完整样本。 Instead I'll try my best at sketching the expectation point approach I mentioned at the outset. 相反,我会尽力勾画我在开头提到的期望点方法。

Mock Up Sample To Compile 模拟样本编译

This took the most time. 这花了最多的时间。 (Consider doing the leg work for the people who are going to help you) (考虑为那些要帮助你的人做腿部工作)

Live On Coliru 住在Coliru

#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <map>

namespace qi = boost::spirit::qi;

struct DummyBuilder {
    using result_type = bool;

    template <typename... Ts>
    bool operator()(Ts&&...) const { return true; }
};

struct PureIsotopeBuilder     : DummyBuilder {  };
struct IsotopesMixtureBuilder : DummyBuilder {  };
struct NaturalElementBuilder  : DummyBuilder {  };
struct UpdateElement          : DummyBuilder {  };

struct Isotope {
    std::string getName() const { return _name; }

    Isotope(std::string const& name = "unnamed", std::string const& symbol = "?") : _name(name), _symbol(symbol) { }

    template <typename T> std::string getProperty(std::string const& name) const {
        if (name == "symbol")
            return _symbol;
        throw std::domain_error("no such property (" + name + ")");
    }

  private:
    std::string _name, _symbol;
};

using MixComponent    = std::pair<Isotope, double>;
using isotopesMixture = std::list<MixComponent>;

template <typename Isotope>
struct ChemicalDatabaseManager {
    static ChemicalDatabaseManager* Instance() {
        static ChemicalDatabaseManager s_instance;
        return &s_instance;
    }

    auto& getDatabase() { return _db; }
  private:
    std::map<int, Isotope> _db {
        { 1, { "H[1]",   "H" } },
        { 2, { "H[2]",   "H" } },
        { 3, { "Carbon", "C" } },
        { 4, { "U[235]", "U" } },
    };
};

template <typename Iterator>
struct ChemicalFormulaParser : qi::grammar<Iterator, isotopesMixture(), qi::locals<isotopesMixture, double> >
{
    ChemicalFormulaParser(): ChemicalFormulaParser::base_type(_start)
    {
        using namespace qi;
        namespace phx = boost::phoenix;

        phx::function<PureIsotopeBuilder>     build_pure_isotope;     // Semantic action for handling the case of pure isotope
        phx::function<IsotopesMixtureBuilder> build_isotopes_mixture; // Semantic action for handling the case of pure isotope mixture
        phx::function<NaturalElementBuilder>  build_natural_element;  // Semantic action for handling the case of natural element
        phx::function<UpdateElement>          update_element;

        // XML database that store all the isotopes of the periodical table
        ChemicalDatabaseManager<Isotope>* imgr = ChemicalDatabaseManager<Isotope>::Instance();
        const auto& isotopeDatabase=imgr->getDatabase();

        // Loop over the database to the spirit symbols for the isotopes names (e.g. H[1],C[14]) and the elements (e.g. H,C)
        for (const auto& isotope : isotopeDatabase) {
            _isotopeNames.add(isotope.second.getName(),isotope.second.getName());
            _elementSymbols.add(isotope.second.template getProperty<std::string>("symbol"),isotope.second.template getProperty<std::string>("symbol"));
        }

        _mixtureToken         = "{" >> +(_isotopeNames >> "(" >> double_ >> ")") >> "}";
        _isotopesMixtureToken = (_elementSymbols[_a=_1] >> _mixtureToken[_b=_1])[_pass=build_isotopes_mixture(_val,_a,_b)];

        _pureIsotopeToken     = (_isotopeNames[_a=_1])[_pass=build_pure_isotope(_val,_a)];
        _naturalElementToken  = (_elementSymbols[_a=_1])[_pass=build_natural_element(_val,_a)];

        _start = +( ( (_isotopesMixtureToken | _pureIsotopeToken | _naturalElementToken)[_a=_1] >>
                    (double_|attr(1.0))[_b=_1]) [_pass=update_element(_val,_a,_b)] );
    }

  private:
    //! Defines the rule for matching a prefix
    qi::symbols<char, std::string> _isotopeNames;
    qi::symbols<char, std::string> _elementSymbols;

    qi::rule<Iterator, isotopesMixture()> _mixtureToken;
    qi::rule<Iterator, isotopesMixture(), qi::locals<std::string, isotopesMixture> > _isotopesMixtureToken;

    qi::rule<Iterator, isotopesMixture(), qi::locals<std::string> > _pureIsotopeToken;
    qi::rule<Iterator, isotopesMixture(), qi::locals<std::string> > _naturalElementToken;

    qi::rule<Iterator, isotopesMixture(), qi::locals<isotopesMixture, double> > _start;
};

int main() {
    using It = std::string::const_iterator;
    ChemicalFormulaParser<It> parser;
    for (std::string const input : {
            "C",                        // --> natural carbon made of C[12] and C[13] in natural abundance
            "CH4",                      // --> methane made of natural carbon and hydrogen
            "C2H{H[1](0.8)H[2](0.2)}6", // --> ethane made of natural C and non-natural H made of 80% of hydrogen and 20% of deuterium
            "C2H{H[1](0.9)H[2](0.2)}6", // --> invalid mixture (total is 110%?)
            "U[235]",                   // --> pure uranium 235
        })
    {
        std::cout << " ============= '" << input << "' ===========\n";
        It f = input.begin(), l = input.end();
        isotopesMixture mixture;
        bool ok = qi::parse(f, l, parser, mixture);

        if (ok)
            std::cout << "Parsed successfully\n";
        else
            std::cout << "Parse failure\n";

        if (f != l)
            std::cout << "Remaining input unparsed: '" << std::string(f, l) << "'\n";
    }
}

Which, as given, just prints 这就是给定的,只是打印

 ============= 'C' ===========
Parsed successfully
 ============= 'CH4' ===========
Parsed successfully
 ============= 'C2H{H[1](0.8)H[2](0.2)}6' ===========
Parsed successfully
 ============= 'C2H{H[1](0.9)H[2](0.2)}6' ===========
Parsed successfully
 ============= 'U[235]' ===========
Parsed successfully

General remarks: 一般说明:

  1. no need for the locals, just use the regular placeholders: 不需要本地人,只需使用常规占位符:

     _mixtureToken = "{" >> +(_isotopeNames >> "(" >> double_ >> ")") >> "}"; _isotopesMixtureToken = (_elementSymbols >> _mixtureToken) [ _pass=build_isotopes_mixture(_val, _1, _2) ]; _pureIsotopeToken = _isotopeNames [ _pass=build_pure_isotope(_val, _1) ]; _naturalElementToken = _elementSymbols [ _pass=build_natural_element(_val, _1) ]; _start = +( ( (_isotopesMixtureToken | _pureIsotopeToken | _naturalElementToken) >> (double_|attr(1.0)) ) [ _pass=update_element(_val, _1, _2) ] ); // .... qi::rule<Iterator, isotopesMixture()> _mixtureToken; qi::rule<Iterator, isotopesMixture()> _isotopesMixtureToken; qi::rule<Iterator, isotopesMixture()> _pureIsotopeToken; qi::rule<Iterator, isotopesMixture()> _naturalElementToken; qi::rule<Iterator, isotopesMixture()> _start; 
  2. you will want to handle conflicts between names/symbols (possibly just by prioritizing one or the other) 你会想要处理名称/符号之间的冲突(可能只是通过优先考虑一个或另一个)

  3. conforming compilers will require the template qualifier (unless I totally mis-guessed your datastructure, in which case I don't know what the template argument to ChemicalDatabaseManager was supposed to mean). 符合编译器将需要template限定符(除非我完全错误地猜测了您的数据结构,在这种情况下我不知道ChemicalDatabaseManager的模板参数应该是什么意思)。

    Hint, MSVC is not a standards-conforming compiler 提示,MSVC不是符合标准的编译器

Live On Coliru 住在Coliru

Expectation Point Sketch 期望点素描

Assuming that the "weights" need to add up to 100% inside the _mixtureToken rule, we can either make build_isotopes_micture "not dummy" and add the validation: 假设“权重”需要在_mixtureToken规则中加起来达到100%,我们可以使build_isotopes_micture “不是虚拟”并添加验证:

struct IsotopesMixtureBuilder {
    bool operator()(isotopesMixture&/* output*/, std::string const&/* elementSymbol*/, isotopesMixture const& mixture) const {
        using namespace boost::adaptors;

        // validate weights total only
        return std::abs(1.0 - boost::accumulate(mixture | map_values, 0.0)) < 0.00001;
    }
};

However, as you note, it will thwart things by backtracking. 但是,正如您所指出的那样,它会通过回溯来阻止事情发生。 Instead you might /assert/ that any complete mixture add up to 100%: 相反,你可以/断言/任何完整的混合物加起来为100%:

_mixtureToken         = "{" >> +(_isotopeNames >> "(" >> double_ >> ")") >> "}" > eps(validate_weight_total(_val));

With something like 有类似的东西

struct ValidateWeightTotal {
    bool operator()(isotopesMixture const& mixture) const {
        using namespace boost::adaptors;

        bool ok = std::abs(1.0 - boost::accumulate(mixture | map_values, 0.0)) < 0.00001;
        return ok;
        // or perhaps just :
        return ok? ok : throw InconsistentsWeights {};
    }

    struct InconsistentsWeights : virtual std::runtime_error {
        InconsistentsWeights() : std::runtime_error("InconsistentsWeights") {}
    };
};

Live On Coliru 住在Coliru

#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/range/adaptors.hpp>
#include <boost/range/numeric.hpp>
#include <map>

namespace qi = boost::spirit::qi;

struct DummyBuilder {
    using result_type = bool;

    template <typename... Ts>
    bool operator()(Ts&&...) const { return true; }
};

struct PureIsotopeBuilder     : DummyBuilder {  };
struct NaturalElementBuilder  : DummyBuilder {  };
struct UpdateElement          : DummyBuilder {  };

struct Isotope {
    std::string getName() const { return _name; }

    Isotope(std::string const& name = "unnamed", std::string const& symbol = "?") : _name(name), _symbol(symbol) { }

    template <typename T> std::string getProperty(std::string const& name) const {
        if (name == "symbol")
            return _symbol;
        throw std::domain_error("no such property (" + name + ")");
    }

  private:
    std::string _name, _symbol;
};

using MixComponent    = std::pair<Isotope, double>;
using isotopesMixture = std::list<MixComponent>;

struct IsotopesMixtureBuilder {
    bool operator()(isotopesMixture&/* output*/, std::string const&/* elementSymbol*/, isotopesMixture const& mixture) const {
        using namespace boost::adaptors;

        // validate weights total only
        return std::abs(1.0 - boost::accumulate(mixture | map_values, 0.0)) < 0.00001;
    }
};

struct ValidateWeightTotal {
    bool operator()(isotopesMixture const& mixture) const {
        using namespace boost::adaptors;

        bool ok = std::abs(1.0 - boost::accumulate(mixture | map_values, 0.0)) < 0.00001;
        return ok;
        // or perhaps just :
        return ok? ok : throw InconsistentsWeights {};
    }

    struct InconsistentsWeights : virtual std::runtime_error {
        InconsistentsWeights() : std::runtime_error("InconsistentsWeights") {}
    };
};

template <typename Isotope>
struct ChemicalDatabaseManager {
    static ChemicalDatabaseManager* Instance() {
        static ChemicalDatabaseManager s_instance;
        return &s_instance;
    }

    auto& getDatabase() { return _db; }
  private:
    std::map<int, Isotope> _db {
        { 1, { "H[1]",   "H" } },
        { 2, { "H[2]",   "H" } },
        { 3, { "Carbon", "C" } },
        { 4, { "U[235]", "U" } },
    };
};

template <typename Iterator>
struct ChemicalFormulaParser : qi::grammar<Iterator, isotopesMixture()>
{
    ChemicalFormulaParser(): ChemicalFormulaParser::base_type(_start)
    {
        using namespace qi;
        namespace phx = boost::phoenix;

        phx::function<PureIsotopeBuilder>     build_pure_isotope;     // Semantic action for handling the case of pure isotope
        phx::function<IsotopesMixtureBuilder> build_isotopes_mixture; // Semantic action for handling the case of pure isotope mixture
        phx::function<NaturalElementBuilder>  build_natural_element;  // Semantic action for handling the case of natural element
        phx::function<UpdateElement>          update_element;
        phx::function<ValidateWeightTotal>    validate_weight_total;

        // XML database that store all the isotopes of the periodical table
        ChemicalDatabaseManager<Isotope>* imgr = ChemicalDatabaseManager<Isotope>::Instance();
        const auto& isotopeDatabase=imgr->getDatabase();

        // Loop over the database to the spirit symbols for the isotopes names (e.g. H[1],C[14]) and the elements (e.g. H,C)
        for (const auto& isotope : isotopeDatabase) {
            _isotopeNames.add(isotope.second.getName(),isotope.second.getName());
            _elementSymbols.add(isotope.second.template getProperty<std::string>("symbol"), isotope.second.template getProperty<std::string>("symbol"));
        }

        _mixtureToken         = "{" >> +(_isotopeNames >> "(" >> double_ >> ")") >> "}" > eps(validate_weight_total(_val));
        _isotopesMixtureToken = (_elementSymbols >> _mixtureToken) [ _pass=build_isotopes_mixture(_val, _1, _2) ];

        _pureIsotopeToken     = _isotopeNames [ _pass=build_pure_isotope(_val, _1) ];
        _naturalElementToken  = _elementSymbols [ _pass=build_natural_element(_val, _1) ];

        _start = +( 
                ( (_isotopesMixtureToken | _pureIsotopeToken | _naturalElementToken) >>
                  (double_|attr(1.0)) ) [ _pass=update_element(_val, _1, _2) ] 
            );
    }

  private:
    //! Defines the rule for matching a prefix
    qi::symbols<char, std::string> _isotopeNames;
    qi::symbols<char, std::string> _elementSymbols;

    qi::rule<Iterator, isotopesMixture()> _mixtureToken;
    qi::rule<Iterator, isotopesMixture()> _isotopesMixtureToken;
    qi::rule<Iterator, isotopesMixture()> _pureIsotopeToken;
    qi::rule<Iterator, isotopesMixture()> _naturalElementToken;
    qi::rule<Iterator, isotopesMixture()> _start;
};

int main() {
    using It = std::string::const_iterator;
    ChemicalFormulaParser<It> parser;
    for (std::string const input : {
            "C",                        // --> natural carbon made of C[12] and C[13] in natural abundance
            "CH4",                      // --> methane made of natural carbon and hydrogen
            "C2H{H[1](0.8)H[2](0.2)}6", // --> ethane made of natural C and non-natural H made of 80% of hydrogen and 20% of deuterium
            "C2H{H[1](0.9)H[2](0.2)}6", // --> invalid mixture (total is 110%?)
            "U[235]",                   // --> pure uranium 235
        }) try 
    {
        std::cout << " ============= '" << input << "' ===========\n";
        It f = input.begin(), l = input.end();
        isotopesMixture mixture;
        bool ok = qi::parse(f, l, parser, mixture);

        if (ok)
            std::cout << "Parsed successfully\n";
        else
            std::cout << "Parse failure\n";

        if (f != l)
            std::cout << "Remaining input unparsed: '" << std::string(f, l) << "'\n";
    } catch(std::exception const& e) {
        std::cout << "Caught exception '" << e.what() << "'\n";
    }
}

Prints 打印

 ============= 'C' ===========
Parsed successfully
 ============= 'CH4' ===========
Parsed successfully
 ============= 'C2H{H[1](0.8)H[2](0.2)}6' ===========
Parsed successfully
 ============= 'C2H{H[1](0.9)H[2](0.2)}6' ===========
Caught exception 'boost::spirit::qi::expectation_failure'
 ============= 'U[235]' ===========
Parsed successfully

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM