简体   繁体   English

化学式解析器C ++

[英]Chemical formula parser C++

I am currently working on a program that can parse a chemical formula and return molecular weight and percent composition. 我目前正在开发一个程序,可以解析化学式并返回分子量和百分比组成。 The following code works very well with compounds such as H 2 O, LiOH, CaCO 3 , and even C 12 H 22 O 11 . 以下代码适用于H 2 O,LiOH,CaCO 3甚至C 12 H 22 O 11等化合物。 However, it is not capable of understanding compounds with polyatomic ions that lie within parenthesis, such as (NH 4 ) 2 SO 4 . 然而,它不能理解具有位于括号内的多原子离子的化合物,例如(NH 42 SO 4

I am not looking for someone to necessarily write the program for me, but just give me a few tips on how I might accomplish such a task. 我不是在寻找一个必须为我编写程序的人,而只是给我一些关于如何完成这项任务的技巧。

Currently, the program iterates through the inputted string, raw_molecule , first finding each element's atomic number, to store in a vector (I use a map<string, int> to store names and atomic #). 目前,程序遍历输入的字符串raw_molecule ,首先查找每个元素的原子序数,以存储在向量中(我使用map<string, int>来存储名称和原子#)。 It then finds the quantities of each element. 然后它找到每个元素的数量。

bool Compound::parseString() {
map<string,int>::const_iterator search;
string s_temp;
int i_temp;

for (int i=0; i<=raw_molecule.length(); i++) {
    if ((isupper(raw_molecule[i]))&&(i==0))
        s_temp=raw_molecule[i];
    else if(isupper(raw_molecule[i])&&(i!=0)) {
        // New element- so, convert s_temp to atomic # then store in v_Elements
        search=ATOMIC_NUMBER.find (s_temp);
        if (search==ATOMIC_NUMBER.end()) 
            return false;// There is a problem
        else
            v_Elements.push_back(search->second); // Add atomic number into vector

        s_temp=raw_molecule[i]; // Replace temp with the new element

    }
    else if(islower(raw_molecule[i]))
        s_temp+=raw_molecule[i]; // E.g. N+=a which means temp=="Na"
    else
        continue; // It is a number/parentheses or something
}
// Whatever's in temp must be converted to atomic number and stored in vector
search=ATOMIC_NUMBER.find (s_temp);
if (search==ATOMIC_NUMBER.end()) 
    return false;// There is a problem
else
    v_Elements.push_back(search->second); // Add atomic number into vector

// --- Find quantities next --- // 
for (int i=0; i<=raw_molecule.length(); i++) {
    if (isdigit(raw_molecule[i])) {
        if (toInt(raw_molecule[i])==0)
            return false;
        else if (isdigit(raw_molecule[i+1])) {
            if (isdigit(raw_molecule[i+2])) {
                i_temp=(toInt(raw_molecule[i])*100)+(toInt(raw_molecule[i+1])*10)+toInt(raw_molecule[i+2]);
                v_Quantities.push_back(i_temp);
            }
            else {
                i_temp=(toInt(raw_molecule[i])*10)+toInt(raw_molecule[i+1]);
                v_Quantities.push_back(i_temp);
            }

        }
        else if(!isdigit(raw_molecule[i-1])) { // Look back to make sure the digit is not part of a larger number
            v_Quantities.push_back(toInt(raw_molecule[i])); // This will not work for polyatomic ions
        }
    }
    else if(i<(raw_molecule.length()-1)) {
        if (isupper(raw_molecule[i+1])) {
            v_Quantities.push_back(1);
        }
    }
    // If there is no number, there is only 1 atom. Between O and N for example: O is upper, N is upper, O has 1.
    else if(i==(raw_molecule.length()-1)) {
        if (isalpha(raw_molecule[i]))
            v_Quantities.push_back(1);
    }
}

return true;
}

This is my first post, so if I have included too little (or maybe too much) information, please forgive me. 这是我的第一篇文章,所以如果我收录的信息太少(或者说太多),请原谅我。

While you might be able to do an ad-hoc scanner-like thing that can handle one level of parens, the canonical technique used for things like this is to write a real parser. 虽然你可以做一个类似ad-hoc扫描器的事情,可以处理一个级别的parens,但用于这类事情的规范技术是编写一个真正的解析器。

And there are two common ways to do that... 有两种常见的方法可以做到这一点......

  1. Recursive descent 递归下降
  2. Machine-generated bottom-up parser based on a grammar-specification file. 基于语法规范文件的机器生成的自下而上解析器。

(And technically, there is a third category, PEG, that is machine-generated-top-down.) (从技术上讲,还有第三类,PEG,机器生成自上而下。)

Anyway, for case 1, you need to code a recursive call to your parser when you see a ( and then return from this level of recursion on the ) token. 无论如何,对于情况1,当您看到(然后从该递归级别返回)令牌时,您需要编写对解析器的递归调用

Typically a tree-like internal representation is created; 通常会创建一个树状的内部表示; this is called a syntax tree , but in your case, you can probably skip that and just return the atomic weight from the recursive call, adding to the level you will be returning from the first instance. 这被称为语法树 ,但在您的情况下,您可以跳过它,只返回递归调用的原子权重,添加到您将从第一个实例返回的级别。

For case 2, you need to use a tool like yacc to turn a grammar into a parser. 对于案例2,您需要使用yacc之类的工具将语法转换为解析器。

Your parser understands certain things. 您的解析器了解某些事情。 It know that when it sees N , that this means "Atom of Nitrogen type". 它知道当它看到N时,这意味着“氮原子类型”。 When it sees O , it means "Atom of Oxygen type". 当它看到O ,它意味着“氧气类型原子”。

This is very similar to the concept of identifiers in C++. 这与C ++中的标识符概念非常相似。 When the compiler sees int someNumber = 5; 当编译器看到int someNumber = 5; , it says, "there exists a variable named someNumber of int type, into which the number 5 is stored". 它说,“存在一个名为someNumberint类型的变量,其中存储了数字5 ”。 If you later use the name someNumber , it knows that you're talking about that someNumber (as long as you're in the right scope). 如果以后使用的名称someNumber ,它知道你在谈论的是 someNumber (只要你在正确的范围是)。

Back to your atomic parser. 回到你的原子解析器。 When your parser sees an atom followed by a number, it knows to apply that number to that atom. 当你的解析器看到一个后跟一个数字的原子时,它知道将该数字应用于那个原子。 So O2 means "2 Atoms of Oxygen type". 所以O2意思是“2种氧原子”。 N2 means "2 Atoms of Nitrogen type." N2表示“2种氮原子”。

This means something for your parser. 这对你的解析器意味着什么。 It means that seeing an atom isn't sufficient . 这意味着看到一个原子是不够的 It's a good start, but it is not sufficient to know how many of that atom exists in the molecule. 这是一个良好的开端,但仅知道分子中存在多少原子是不够的。 It needs to read the next thing. 它需要阅读下一件事。 So if it sees O followed by N , it knows that the O means "1 Atom of Oxygen type". 因此,如果它看到O后跟N ,则它知道O表示“1氧原子类型”。 If it sees O followed by nothing (the end of the input), then it again means "1 Atom of Oxygen type". 如果它看到O后面没有任何东西(输入结束),那么它再次表示“1氧气类型原子”。

That's what you have currently. 这就是你目前所拥有的。 But it's wrong . 但这是错的 Because numbers don't always modify atoms; 因为数字并不总是修改原子; sometimes, they modify groups of atoms. 有时,它们会修改原子 As in (NH4)2SO4 . (NH4)2SO4

So now, you need to change how your parser works . 所以现在,您需要更改解析器的工作方式 When it sees O , it needs to know that this is not "Atom of Oxygen type". 当它看到O ,它需要知道这不是“氧气类型原子”。 It is a " Group containing Oxygen". 它是“含氧 ”。 O2 is "2 Groups containing Oxygen". O2是“含氧的2个基团”。

A group can contain one or more atoms. 一个组可以包含一个或多个原子。 So when you see ( , you know that you're creating a group . Therefore, when you see (...)3 , you see "3 Groups containing ...". 所以当你看到(你知道你正在创建一个 。因此,当你看到(...)3 ,你会看到“3个包含...的组”。

So, what is (NH4)2 ? 那么, (NH4)2什么? It is "2 Groups containing [1 Group containing Nitrogen followed by 4 Groups containing Hydrogen]". 它是“含有[1个含氮的基团,然后含有4个含氢基团]的2个基团”。

The key to doing this is understanding what I just wrote. 这样做的关键是理解我刚写的内容。 Groups can contain other groups . 组可以包含其他组 There is nesting in groups. 有小组嵌套。 How do you implement nesting? 你如何实现嵌套?

Well, your parser looks something like this currently: 好吧,你的解析器目前看起来像这样:

NumericAtom ParseAtom(input)
{
  Atom = ReadAtom(input); //Gets the atom and removes it from the current input.
  if(IsNumber(input)) //Returns true if the input is looking at a number.
  {
    int Count = ReadNumber(input); //Gets the number and removes it from the current input.
    return NumericAtom(Atom, Count);
  }

  return NumericAtom(Atom, 1);
}

vector<NumericAtom> Parse(input)
{
  vector<NumericAtom> molecule;
  while(IsAtom(input))
    molecule.push_back(ParseAtom(input));
  return molecule;
}

Your code calls ParseAtom() until the input runs dry, storing each atom+count in an array. 您的代码调用ParseAtom()直到输入运行干,将每个atom + count存储在一个数组中。 Obviously you have some error-checking in there, but let's ignore that for now. 显然你在那里有一些错误检查,但是现在让我们忽略它。

What you need to do is stop parsing atoms. 你需要做的是停止解析原子。 You need to parse groups , which are either a single atom, or a group of atoms denoted by () pairs. 您需要解析 ,这些可以是单个原子,也可以是由()对表示的一组原子。

Group ParseGroup(input)
{
    Group myGroup; //Empty group

    if(IsLeftParen(input)) //Are we looking at a `(` character?
    {
        EatLeftParen(input); //Removes the `(` from the input.

        myGroup.SetSequence(ParseGroupSequence(input)); //RECURSIVE CALL!!!

        if(!IsRightParen(input)) //Groups started by `(` must end with `)`
            throw ParseError("Inner groups must end with `)`.");
        else
            EatRightParen(input); //Remove the `)` from the input.
    }
    else if(IsAtom(input))
    {
        myGroup.SetAtom(ReadAtom(input)); //Group contains one atom.
    }
    else
        throw ParseError("Unexpected input."); //error

    //Read the number.
    if(IsNumber(input))
        myGroup.SetCount(ReadNumber(input));
    else
        myGroup.SetCount(1);

    return myGroup;
}

vector<Group> ParseGroupSequence(input)
{
    vector<Group> groups;

    //Groups continue until the end of input or `)` is reached.
    while(!IsRightParen(input) and !IsEndOfInput(input)) 
        groups.push_back(ParseGroup(input));

    return groups;
}

The big difference here is that ParseGroup (the analog to the ParseAtom function) will call ParseGroupSequence . 这里最大的区别是ParseGroup (与ParseAtom函数ParseAtom )将调用ParseGroupSequence Which will call ParseGroup . 这将调用ParseGroup Which can call ParseGroupSequence . 哪个可以调用ParseGroupSequence Etc. A Group can either contain an atom or a sequence of Group s (such as NH4 ), stored as a vector<Group> 等等。一个Group可以包含原子或一Group S(如NH4 ),存储为vector<Group>

When functions can call themselves (either directly or indirectly), it is called recursion . 当函数可以自己调用(直接或间接)时,它被称为递归 Which is fine, so long as it doesn't recurse infinitely. 哪个好,只要不能无限递归。 And there's no chance of that, because it will only recurse every time it sees ( . 并且没有机会,因为它只会在每次看到时递归(

So how does this work? 那么这是如何工作的呢? Well, let's consider some possible inputs: 好吧,让我们考虑一些可能的输入:

NH3 NH3

  1. ParseGroupSequence is called. ParseGroupSequence It isn't at the end of input or ) , so it calls ParseGroup . 它不在输入或)的末尾,所以它调用ParseGroup
    1. ParseGroup sees an N , which is an atom. ParseGroup看到一个N ,它是一个原子。 It adds this atom to the Group . 它将此原子添加到Group It then sees an H , which is not a number. 然后它看到一个H ,这不是一个数字。 So it sets the Group 's count to 1, then returns the Group . 因此,它将Group的计数设置为1,然后返回Group
  2. Back in ParseGroupSeqeunce , we store the returned group in the sequence, then iterate in our loop. 回到ParseGroupSeqeunce ,我们将返回的组存储在序列中,然后在循环中迭代。 We don't see the end of input or ) , so it calls ParseGroup : 我们没有看到输入结束或) ,所以它调用ParseGroup
    1. ParseGroup sees an H , which is an atom. ParseGroup看到一个H ,它是一个原子。 It adds this atom to the Group . 它将此原子添加到Group It then sees a 3 , which is a number. 然后它看到一个3 ,这是一个数字。 So it reads this number, sets it as the Group 's count, and returns the Group . 因此,它读取此数字,将其设置为Group的计数,并返回该Group
  3. Back in ParseGroupSeqeunce , we store the returned Group in the sequence, then iterate in our loop. 回到ParseGroupSeqeunce ,我们将返回的Group存储在序列中,然后在循环中迭代。 We don't see ) , but we do see the end of input. 我们没有看到) ,但我们确实看到了输入的结束。 So we return the current vector<Group> . 所以我们返回当前vector<Group>

(NH3)2 (NH 3)2

  1. ParseGroupSequence is called. ParseGroupSequence It isn't at the end of input or ) , so it calls ParseGroup . 它不在输入或)的末尾,所以它调用ParseGroup
    1. ParseGroup sees an ( , which is the start of a Group . It eats this character (removing it from the input) and calls ParseGroupSequence on the Group . ParseGroup看到一个(它是一个Group的开头。它吃掉这个字符(从输入中删除它)并在Group上调用ParseGroupSequence
      1. ParseGroupSequence isn't at the end of input or ) , so it calls ParseGroup . ParseGroupSequence不在输入或)的末尾,因此它调用ParseGroup
        1. ParseGroup sees an N , which is an atom. ParseGroup看到一个N ,它是一个原子。 It adds this atom to the Group . 它将此原子添加到Group It then sees an H , which is not a number. 然后它看到一个H ,这不是一个数字。 So it sets the group's count to 1, then returns the Group . 因此,它将组的计数设置为1,然后返回Group
      2. Back in ParseGroupSeqeunce , we store the returned group in the sequence, then iterate in our loop. 回到ParseGroupSeqeunce ,我们将返回的组存储在序列中,然后在循环中迭代。 We don't see the end of input or ) , so it calls ParseGroup : 我们没有看到输入结束或) ,所以它调用ParseGroup
        1. ParseGroup sees an H , which is an atom. ParseGroup看到一个H ,它是一个原子。 It adds this atom to the Group . 它将此原子添加到Group It then sees a 3 , which is a number. 然后它看到一个3 ,这是一个数字。 So it reads this number, sets it as the Group 's count, and returns the Group . 因此,它读取此数字,将其设置为Group的计数,并返回该Group
      3. Back in ParseGroupSeqeunce , we store the returned group in the sequence, then iterate in our loop. 回到ParseGroupSeqeunce ,我们将返回的组存储在序列中,然后在循环中迭代。 We don't see the end of input, but we do see ) . 我们看不到输入的结束,但我们确实看到了) So we return the current vector<Group> . 所以我们返回当前vector<Group>
    2. Back in the first call to ParseGroup , we get the vector<Group> back. 回到第一次调用 ParseGroup ,我们得到vector<Group> We stick it into our current Group as a sequence. 我们将它作为序列粘贴到我们当前的Group中。 We check to see if the next character is ) , eat it, and continue. 我们检查,看看下一个字符是) ,吃它,然后继续。 We see a 2 , which is a number. 我们看到一个2 ,这是一个数字。 So it reads this number, sets it as the Group 's count, and returns the Group . 因此,它读取此数字,将其设置为Group的计数,并返回该Group
  2. Now, way, way back at the original ParseGroupSequence call, we store the returned Group in the sequence, then iterate in our loop. 现在,方式, 途径回到原来的ParseGroupSequence电话,我们存储返回Group序列中,然后遍历在我们的循环。 We don't see ) , but we do see the end of input. 我们没有看到) ,但我们确实看到了输入的结束。 So we return the current vector<Group> . 所以我们返回当前vector<Group>

This parser uses recursion to "descend" into each group. 此解析器使用递归“下降”到每个组中。 Therefore, this kind of parser is called a "recursive descent parser" (there's a formal definition for this kind of thing, but this is a good lay-understanding of the concept). 因此,这种解析器被称为“递归下降解析器”(对于这种事物有一个正式的定义,但这是对该概念的良好的理解)。

It is often helpful to write down the rules of the grammar for the strings you want to read and recognise. 写下您想要阅读和识别的字符串的语法规则通常很有帮助。 A grammar is just a bunch of rules which say what sequence of characters is acceptable, and by implication which are not acceptable. 语法只是一堆规则,它们说明什么样的字符序列是可以接受的,并且暗示是不可接受的。 It helps to have the grammar before and while writing the program, and might be fed into a parser generator (as described by DigitalRoss) 它有助于在编写程序之前和编写程序时使用语法,并且可能会被提供给解析器生成器(如DigitalRoss所述)

For example, the rules for the simple compound, without polyatomic ions looks like: 例如,没有多原子离子的简单化合物的规则如下:

Compound:  Component { Component };
Component: Atom [Quantity] 
Atom: 'H' | 'He' | 'Li' | 'Be' ...
Quantity: Digit { Digit }
Digit: '0' | '1' | ... '9'
  • [...] is read as optional, and will be an if test in the program (either it is there or missing) [...]被视为可选项,并且将成为程序中的if测试(无论是存在还是缺失)
  • | is alternatives, and so is an if .. else if .. else or switch 'test', it is saying the input must match one of these 是替代品,所以if .. else if .. else或switch'test',它表示输入必须匹配其中一个
  • { ... } is read as repetition of 0 or more, and will be a while loop in the program { ... }被读作0或更多的重复,并且将是程序中的while循环
  • Characters between quotes are literal characters which will be in the string. 引号之间的字符是字符串中的字符。 All the other words are names of rules, and for a recursive descent parser, end up being the names of the functions which get called to chop up, and handle the input. 所有其他的单词都是规则的名称,对于递归下降解析器,最终是被调用的函数的名称,并处理输入。

For example, the function that implements the 'Quantity' rule just needs to read one or mre digits characters, and converts them to an integer. 例如,实现“数量”规则的函数只需读取一个或多个数字字符,并将它们转换为整数。 The function that implements the Atom rule reads enough characters to figure out which atom it is, and stores that away. 实现Atom规则的函数读取足够的字符以确定它是哪个原子,并将其存储起来。

A nice thing about recursive descent parsers is the error messages can be quite helpful, and of the form, "Expecting an Atom name, but got %c", or "Expecting a ')' but reached tghe end of the string". 关于递归下降解析器的一个好处是错误消息可能非常有用,并且形式为“期望一个Atom名称,但得到%c”或“期待一个”)但是到达了字符串的结尾“。 It is a bit complicated to recover after an error, so you might want to throw an exception at the first error. 在发生错误后恢复有点复杂,因此您可能希望在第一个错误时抛出异常。

So are polyatomic ions just one level of parenthesis? 那么多原子离子只是括号的一个层次吗? If so, the grammar might be: 如果是这样,语法可能是:

Compound: Component { Component }  
Component: Atom [Quantity] | '(' Component { Component } ')' [Quantity];
Atom: 'H' | 'He' | 'Li' ...
Quantity: Digit { Digit }
Digit: '0' | '1' | ... '9'

Or is it more complex, and the notation must allow for nested parenthesis. 或者它更复杂,并且符号必须允许嵌套括号。 Once that is clear, you can figure out an approach to parsing. 一旦清楚,您就可以找到解析的方法。

I do not know the entire scope of your problem, but recursive descent parsers are relatively straightforward to write, and look adequate for your problem. 我不知道你的问题的整个范围,但递归下降解析器编写相对简单,并且看起来足够你的问题。

Consider re-structuring your program as a simple Recursive Descent Parser . 考虑将程序重构为简单的递归下降解析器

First, you need to change the parseString function to take a string to be parsed, and the current position from which to start the parse, passed by reference. 首先,您需要更改parseString函数以获取要解析的string ,以及通过引用传递的开始解析的当前位置。

This way you can structure your code so that when you see a ( you call the same function at the next position get a Composite back, and consume the closing ) . 通过这种方式,您可以构建代码,以便在看到a时( 在下一个位置调用相同的函数,然后返回Composite ,并使用结束) When you see a ) by itself, you return without consuming it. 当你看到a )本身,你返回而不消耗它。 This lets you consume formulas with unlimited nesting of ( and ) , although I am not sure if it is necessary (it's been more than 20 years since the last time I saw a chemical formula). 这可以让你使用无限嵌套()公式,虽然我不确定是否有必要(自从我上次看到化学式时已超过20年)。

This way you'd write the code for parsing composite only once, and re-use it as many times as needed. 这样,您只需编写一次解析复合的代码,并根据需要多次重复使用它。 It will be easy to supplement your reader to consume formulas with dashes etc., because your parser will need to deal only with the basic building blocks. 很容易补充你的读者使用破折号等公式,因为你的解析器只需要处理基本的构建块。

Maybe you can get rid of brackets before parsing. 也许你可以在解析之前摆脱括号。 You need to find how many "brackets in brackets" (sorry for my english) are there and rewrite it like that beginning with the "deepest": 你需要找到多少“括号中的括号”(对不起我的英语),然后重写它就像从“最深”开始的那样:

  1. (NH 4 (Na 2 H 4 ) 3 Zn) 2 SO 4 (this formula doesn't mean anyting, actually...) (NH 4 (Na 2 H 43 Zn) 2 SO 4 (这个公式并不意味着任何,实际......)

  2. (NH 4 Na 6 H 12 Zn) 2 SO 4 (NH 4 Na 6 H 12 Zn) 2 SO 4

  3. NH 8 Na 12 H 24 Zn 2 SO 4 NH 8 Na 12 H 24 Zn 2 SO 4

  4. no brackets left, let's run your code with NH 8 Na 12 H 24 Zn 2 SO 4 没有括号,让我们用NH 8 Na 12 H 24 Zn 2 SO 4运行您的代码

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM