Chemical formula parser C++

Question

I am currently working on a program that can parse a chemical formula and return molecular weight and percent composition. The following code works very well with compounds such as H ₂ O, LiOH, CaCO ₃ , and even C ₁₂ H ₂₂ O ₁₁ . However, it is not capable of understanding compounds with polyatomic ions that lie within parenthesis, such as (NH ₄ ) ₂ SO ₄ .

I am not looking for someone to necessarily write the program for me, but just give me a few tips on how I might accomplish such a task.

Currently, the program iterates through the inputted string, raw_molecule , first finding each element's atomic number, to store in a vector (I use a map<string, int> to store names and atomic #). It then finds the quantities of each element.

bool Compound::parseString() {
map<string,int>::const_iterator search;
string s_temp;
int i_temp;

for (int i=0; i<=raw_molecule.length(); i++) {
    if ((isupper(raw_molecule[i]))&&(i==0))
        s_temp=raw_molecule[i];
    else if(isupper(raw_molecule[i])&&(i!=0)) {
        // New element- so, convert s_temp to atomic # then store in v_Elements
        search=ATOMIC_NUMBER.find (s_temp);
        if (search==ATOMIC_NUMBER.end()) 
            return false;// There is a problem
        else
            v_Elements.push_back(search->second); // Add atomic number into vector

        s_temp=raw_molecule[i]; // Replace temp with the new element

    }
    else if(islower(raw_molecule[i]))
        s_temp+=raw_molecule[i]; // E.g. N+=a which means temp=="Na"
    else
        continue; // It is a number/parentheses or something
}
// Whatever's in temp must be converted to atomic number and stored in vector
search=ATOMIC_NUMBER.find (s_temp);
if (search==ATOMIC_NUMBER.end()) 
    return false;// There is a problem
else
    v_Elements.push_back(search->second); // Add atomic number into vector

// --- Find quantities next --- // 
for (int i=0; i<=raw_molecule.length(); i++) {
    if (isdigit(raw_molecule[i])) {
        if (toInt(raw_molecule[i])==0)
            return false;
        else if (isdigit(raw_molecule[i+1])) {
            if (isdigit(raw_molecule[i+2])) {
                i_temp=(toInt(raw_molecule[i])*100)+(toInt(raw_molecule[i+1])*10)+toInt(raw_molecule[i+2]);
                v_Quantities.push_back(i_temp);
            }
            else {
                i_temp=(toInt(raw_molecule[i])*10)+toInt(raw_molecule[i+1]);
                v_Quantities.push_back(i_temp);
            }

        }
        else if(!isdigit(raw_molecule[i-1])) { // Look back to make sure the digit is not part of a larger number
            v_Quantities.push_back(toInt(raw_molecule[i])); // This will not work for polyatomic ions
        }
    }
    else if(i<(raw_molecule.length()-1)) {
        if (isupper(raw_molecule[i+1])) {
            v_Quantities.push_back(1);
        }
    }
    // If there is no number, there is only 1 atom. Between O and N for example: O is upper, N is upper, O has 1.
    else if(i==(raw_molecule.length()-1)) {
        if (isalpha(raw_molecule[i]))
            v_Quantities.push_back(1);
    }
}

return true;
}

This is my first post, so if I have included too little (or maybe too much) information, please forgive me.

Answer 1

While you might be able to do an ad-hoc scanner-like thing that can handle one level of parens, the canonical technique used for things like this is to write a real parser.

And there are two common ways to do that...

Recursive descent
Machine-generated bottom-up parser based on a grammar-specification file.

(And technically, there is a third category, PEG, that is machine-generated-top-down.)

Anyway, for case 1, you need to code a recursive call to your parser when you see a ( and then return from this level of recursion on the ) token.

Typically a tree-like internal representation is created; this is called a syntax tree , but in your case, you can probably skip that and just return the atomic weight from the recursive call, adding to the level you will be returning from the first instance.

For case 2, you need to use a tool like yacc to turn a grammar into a parser.

Answer 2

Your parser understands certain things. It know that when it sees N , that this means "Atom of Nitrogen type". When it sees O , it means "Atom of Oxygen type".

This is very similar to the concept of identifiers in C++. When the compiler sees int someNumber = 5; , it says, "there exists a variable named someNumber of int type, into which the number 5 is stored". If you later use the name someNumber , it knows that you're talking about that someNumber (as long as you're in the right scope).

Back to your atomic parser. When your parser sees an atom followed by a number, it knows to apply that number to that atom. So O2 means "2 Atoms of Oxygen type". N2 means "2 Atoms of Nitrogen type."

This means something for your parser. It means that seeing an atom isn't sufficient . It's a good start, but it is not sufficient to know how many of that atom exists in the molecule. It needs to read the next thing. So if it sees O followed by N , it knows that the O means "1 Atom of Oxygen type". If it sees O followed by nothing (the end of the input), then it again means "1 Atom of Oxygen type".

That's what you have currently. But it's wrong . Because numbers don't always modify atoms; sometimes, they modify groups of atoms. As in (NH4)2SO4 .

So now, you need to change how your parser works . When it sees O , it needs to know that this is not "Atom of Oxygen type". It is a " Group containing Oxygen". O2 is "2 Groups containing Oxygen".

A group can contain one or more atoms. So when you see ( , you know that you're creating a group . Therefore, when you see (...)3 , you see "3 Groups containing ...".

So, what is (NH4)2 ? It is "2 Groups containing [1 Group containing Nitrogen followed by 4 Groups containing Hydrogen]".

The key to doing this is understanding what I just wrote. Groups can contain other groups . There is nesting in groups. How do you implement nesting?

Well, your parser looks something like this currently:

NumericAtom ParseAtom(input)
{
  Atom = ReadAtom(input); //Gets the atom and removes it from the current input.
  if(IsNumber(input)) //Returns true if the input is looking at a number.
  {
    int Count = ReadNumber(input); //Gets the number and removes it from the current input.
    return NumericAtom(Atom, Count);
  }

  return NumericAtom(Atom, 1);
}

vector<NumericAtom> Parse(input)
{
  vector<NumericAtom> molecule;
  while(IsAtom(input))
    molecule.push_back(ParseAtom(input));
  return molecule;
}

Your code calls ParseAtom() until the input runs dry, storing each atom+count in an array. Obviously you have some error-checking in there, but let's ignore that for now.

What you need to do is stop parsing atoms. You need to parse groups , which are either a single atom, or a group of atoms denoted by () pairs.

Group ParseGroup(input)
{
    Group myGroup; //Empty group

    if(IsLeftParen(input)) //Are we looking at a `(` character?
    {
        EatLeftParen(input); //Removes the `(` from the input.

        myGroup.SetSequence(ParseGroupSequence(input)); //RECURSIVE CALL!!!

        if(!IsRightParen(input)) //Groups started by `(` must end with `)`
            throw ParseError("Inner groups must end with `)`.");
        else
            EatRightParen(input); //Remove the `)` from the input.
    }
    else if(IsAtom(input))
    {
        myGroup.SetAtom(ReadAtom(input)); //Group contains one atom.
    }
    else
        throw ParseError("Unexpected input."); //error

    //Read the number.
    if(IsNumber(input))
        myGroup.SetCount(ReadNumber(input));
    else
        myGroup.SetCount(1);

    return myGroup;
}

vector<Group> ParseGroupSequence(input)
{
    vector<Group> groups;

    //Groups continue until the end of input or `)` is reached.
    while(!IsRightParen(input) and !IsEndOfInput(input)) 
        groups.push_back(ParseGroup(input));

    return groups;
}

The big difference here is that ParseGroup (the analog to the ParseAtom function) will call ParseGroupSequence . Which will call ParseGroup . Which can call ParseGroupSequence . Etc. A Group can either contain an atom or a sequence of Group s (such as NH4 ), stored as a vector<Group>

When functions can call themselves (either directly or indirectly), it is called recursion . Which is fine, so long as it doesn't recurse infinitely. And there's no chance of that, because it will only recurse every time it sees ( .

So how does this work? Well, let's consider some possible inputs:

NH3

ParseGroupSequence is called. It isn't at the end of input or ) , so it calls ParseGroup .
1. ParseGroup sees an N , which is an atom. It adds this atom to the Group . It then sees an H , which is not a number. So it sets the Group 's count to 1, then returns the Group .
Back in ParseGroupSeqeunce , we store the returned group in the sequence, then iterate in our loop. We don't see the end of input or ) , so it calls ParseGroup :
1. ParseGroup sees an H , which is an atom. It adds this atom to the Group . It then sees a 3 , which is a number. So it reads this number, sets it as the Group 's count, and returns the Group .
Back in ParseGroupSeqeunce , we store the returned Group in the sequence, then iterate in our loop. We don't see ) , but we do see the end of input. So we return the current vector<Group> .

(NH3)2

ParseGroupSequence is called. It isn't at the end of input or ) , so it calls ParseGroup .
1. ParseGroup sees an ( , which is the start of a Group . It eats this character (removing it from the input) and calls ParseGroupSequence on the Group .
  1. ParseGroupSequence isn't at the end of input or ) , so it calls ParseGroup .
    1. ParseGroup sees an N , which is an atom. It adds this atom to the Group . It then sees an H , which is not a number. So it sets the group's count to 1, then returns the Group .
  2. Back in ParseGroupSeqeunce , we store the returned group in the sequence, then iterate in our loop. We don't see the end of input or ) , so it calls ParseGroup :
    1. ParseGroup sees an H , which is an atom. It adds this atom to the Group . It then sees a 3 , which is a number. So it reads this number, sets it as the Group 's count, and returns the Group .
  3. Back in ParseGroupSeqeunce , we store the returned group in the sequence, then iterate in our loop. We don't see the end of input, but we do see ) . So we return the current vector<Group> .
2. Back in the first call to ParseGroup , we get the vector<Group> back. We stick it into our current Group as a sequence. We check to see if the next character is ) , eat it, and continue. We see a 2 , which is a number. So it reads this number, sets it as the Group 's count, and returns the Group .
Now, way, way back at the original ParseGroupSequence call, we store the returned Group in the sequence, then iterate in our loop. We don't see ) , but we do see the end of input. So we return the current vector<Group> .

This parser uses recursion to "descend" into each group. Therefore, this kind of parser is called a "recursive descent parser" (there's a formal definition for this kind of thing, but this is a good lay-understanding of the concept).

Answer 3

It is often helpful to write down the rules of the grammar for the strings you want to read and recognise. A grammar is just a bunch of rules which say what sequence of characters is acceptable, and by implication which are not acceptable. It helps to have the grammar before and while writing the program, and might be fed into a parser generator (as described by DigitalRoss)

For example, the rules for the simple compound, without polyatomic ions looks like:

Compound:  Component { Component };
Component: Atom [Quantity] 
Atom: 'H' | 'He' | 'Li' | 'Be' ...
Quantity: Digit { Digit }
Digit: '0' | '1' | ... '9'

[...] is read as optional, and will be an if test in the program (either it is there or missing)
| is alternatives, and so is an if .. else if .. else or switch 'test', it is saying the input must match one of these
{ ... } is read as repetition of 0 or more, and will be a while loop in the program
Characters between quotes are literal characters which will be in the string. All the other words are names of rules, and for a recursive descent parser, end up being the names of the functions which get called to chop up, and handle the input.

For example, the function that implements the 'Quantity' rule just needs to read one or mre digits characters, and converts them to an integer. The function that implements the Atom rule reads enough characters to figure out which atom it is, and stores that away.

A nice thing about recursive descent parsers is the error messages can be quite helpful, and of the form, "Expecting an Atom name, but got %c", or "Expecting a ')' but reached tghe end of the string". It is a bit complicated to recover after an error, so you might want to throw an exception at the first error.

So are polyatomic ions just one level of parenthesis? If so, the grammar might be:

Compound: Component { Component }  
Component: Atom [Quantity] | '(' Component { Component } ')' [Quantity];
Atom: 'H' | 'He' | 'Li' ...
Quantity: Digit { Digit }
Digit: '0' | '1' | ... '9'

Or is it more complex, and the notation must allow for nested parenthesis. Once that is clear, you can figure out an approach to parsing.

I do not know the entire scope of your problem, but recursive descent parsers are relatively straightforward to write, and look adequate for your problem.

Answer 4

Consider re-structuring your program as a simple Recursive Descent Parser .

First, you need to change the parseString function to take a string to be parsed, and the current position from which to start the parse, passed by reference.

This way you can structure your code so that when you see a ( you call the same function at the next position get a Composite back, and consume the closing ) . When you see a ) by itself, you return without consuming it. This lets you consume formulas with unlimited nesting of ( and ) , although I am not sure if it is necessary (it's been more than 20 years since the last time I saw a chemical formula).

This way you'd write the code for parsing composite only once, and re-use it as many times as needed. It will be easy to supplement your reader to consume formulas with dashes etc., because your parser will need to deal only with the basic building blocks.

Answer 5

Maybe you can get rid of brackets before parsing. You need to find how many "brackets in brackets" (sorry for my english) are there and rewrite it like that beginning with the "deepest":

(NH ₄ (Na ₂ H ₄ ) ₃ Zn) ₂ SO ₄ (this formula doesn't mean anyting, actually...)
(NH ₄ Na ₆ H ₁₂ Zn) ₂ SO ₄
NH ₈ Na ₁₂ H ₂₄ Zn ₂ SO ₄
no brackets left, let's run your code with NH ₈ Na ₁₂ H ₂₄ Zn ₂ SO ₄

Chemical formula parser C++

Question

5 answers

solution1
6 2012-03-31 17:13:34

solution2
4 ACCPTED 2012-03-31 18:36:52

NH3

(NH3)2

solution3
3 2012-03-31 17:53:31

solution4
1 2012-03-31 17:19:37

solution5
0 2012-03-31 17:28:26

Chemical formula parser C++

Question

5 answers

solution1 6 2012-03-31 17:13:34

solution2 4 ACCPTED 2012-03-31 18:36:52

NH3

(NH3)2

solution3 3 2012-03-31 17:53:31

solution4 1 2012-03-31 17:19:37

solution5 0 2012-03-31 17:28:26

solution1
6 2012-03-31 17:13:34

solution2
4 ACCPTED 2012-03-31 18:36:52

solution3
3 2012-03-31 17:53:31

solution4
1 2012-03-31 17:19:37

solution5
0 2012-03-31 17:28:26