简体   繁体   中英

Parsing chemical compounds in Haskell

I was trying to make a chemical compound parser as an exercise for myself but I got stuck.

Here is the data type I am trying to use:

data Compound = Monoatomic String Int | Poliatomic [Compound] Int

Given a string like "Ca(OH)2", I want to get something like;

Poliatomic [Monoatomic "Ca" 1, Poliatomic [Monoatomic "O" 1, Monoatomic "H" 1] 2 ] 1

Monoatomic type constructor for single atoms, and Poliatomic constructor for multiple atoms. In this example (OH)2 represents and inner Poliatomic structure and it is represented as Poliatomic [(Monoatomic O 1), (Monoatomic H 1 )] 2 . The number 2 represents that we have two of those poliatomic structures.

I did this much;

import Data.Char (isUpper)
data Compound = Monoatomic String Int | Poliatomic [Compound] Int

instance Functor Compound where
        fmap f (Monoatomic s i) = Monoatomic (f s) i
        fmap f (Poliatomic xs i) = Poliatomic (fmap f xs) i

-- Change number of a compound
changeNumber :: Compound -> Int -> Compound
changeNumber (Monoatomic xs _) n = Monoatomic xs n
changeNumber (Poliatomic xs _) n = Poliatomic xs n

-- Take a partial compound and next chracter return partial compound
parseCompound :: Compound -> Char -> Compound
parseCompound (Poliatomic x:xs n) c
        | isUpper c = Poliatomic ((Monoatomic [c] 1):x:xs) n -- add new atom to compound
        | isLower c = Poliatomic 

-- I want to do foldl parseCompound (Poliatomic [] 1) inputstring

but then it got too complicated for me to continue.

It looks like it should be a fairly simple problem, but I am very new to Haskell and can't figure out how to complete this function.

I have this questions:

  • Is my approach correct so far?
  • How can I make this work?

I have created the parser you are looking for with Parsec to give you a sense of what Parsec parsers look like, since you stated you had little experience with it.

Even with little Haskell experience, it should be fairly readable. I have provided some comments on the parts where there's something in particular to look out for.

import Text.Read (readMaybe)
import Data.Maybe (fromMaybe)
import Text.Parsec (parse, many, many1, digit, char, string, (<|>), choice, try)
import Text.Parsec.String (Parser)


data Compound
  = Monoatomic String Int
  | Poliatomic [Compound] Int
  deriving Show


-- Run the substance parser on "Ca(OH)2" and print the result which is
-- Right (Poliatomic [Monoatomic "Ca" 1,Poliatomic [Monoatomic "O" 1,Monoatomic "H" 1] 2] 1)
main = print (parse substance "" "Ca(OH)2")


-- parse the many parts which make out the top-level polyatomic compound
--
-- "many1" means "at least one"
substance :: Parser Compound
substance = do
  topLevel <- many1 part
  return (Poliatomic topLevel 1)


-- a single part in a substance is either a poliatomic compound or a monoatomic compound
part :: Parser Compound
part = poliatomic <|> monoatomic


-- a poliatomic compound starts with a '(', then has many parts inside, then
-- ends with ')' and has a number after it which indicates how many of it there
-- are.
poliatomic :: Parser Compound
poliatomic = do
  char '('
  inner <- many1 part
  char ')'
  amount <- many1 digit
  return (Poliatomic inner (read amount))


-- a monoatomic compound is one of the many element names, followed by an
-- optional digit. if omitted, the amount defaults to 1.
--
-- "try" is a little special, and required in this case. it means "if a parser
-- fails, try the next one from where you started, not from where the last one
-- failed."
--
-- "choice" means "try all parsers in this list, stop when one matches"
--
-- "many" means "zero or more"
monoatomic :: Parser Compound
monoatomic = do
  name <- choice [try nameParser | nameParser <- atomstrings]
  amount <- many digit
  return (Monoatomic name (fromMaybe 1 (readMaybe amount)))


-- a list of parser for atom names. it is IMPORTANT that the longest names
-- come first. the reason for that is that it makes the parser much simpler to
-- write, and it can execute much faster. it's common when designing parsers to
-- consider things like that when creating them.
atomstrings :: [Parser String]
atomstrings = map string (words "He Li Be Ne Na Mg Al Ca H B C N O F")

I've tried to write this code in a way that should be at least reasonably accessible to a beginner, but it's probably not crystal clear so I'm happy to answer any questions about this.


The parser above is the one you wanted. However, it's not the one I would write if I had free reins. If I got to do however I wanted, I would exploit the fact that

Ca(OH)2

can be represented as

(Ca)1((O)1(H)1)2

which is a much more uniform representation, and in turns results in a simpler data structure and a parser with less boilerplate. The code I'd prefer to write would look like

import Text.Read (readMaybe)
import Data.Maybe (fromMaybe)
import Control.Applicative ((<$>), (<*>), pure)
import Text.Parsec (parse, many, many1, digit, char, string, (<|>), choice, try, between)
import Text.Parsec.String (Parser)


data Substance
  = Part [Substance] Int
  | Atom String
  deriving Show


main = print (parse substance "" "Ca(OH)2")
-- Right (Part [Part [Atom "Ca"] 1,Part [Part [Atom "O"] 1,Part [Atom "H"] 1] 2] 1)

substance :: Parser Substance
substance = Part <$> many1 part <*> pure 1

part :: Parser Substance
part = do
  inner <- polyatomic <|> monoatomic
  amount <- fromMaybe 1 . readMaybe <$> many digit
  return (Part inner amount)

polyatomic :: Parser [Substance]
polyatomic = between (char '(') (char ')') (many1 part)

monoatomic :: Parser [Substance]
monoatomic = (:[]) . Atom <$> choice (map (try . string) atomstrings)

atomstrings :: [String]
atomstrings = words "He Li Be Ne Na Mg Al Ca H B C N O F"

This uses a few "advanced" tricks in Haskell (such as the <$> and <*> operators) so might not be of interest to you, OP, but I'm putting it in for other people who might be more advanced Haskell users and learning about Parsec.

This parser takes only about half a page, as you see, and that's the power of libraries like Parsec – they make it both easy and fun to write parsers!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM