简体   繁体   中英

Strange behaviour parsing an imperative language using Parsec

I'm trying to parse a fragment of the Abap language with Parsec in haskell. The statements in Abap are delimited by dots. The syntax for function definition is:

FORM <name> <arguments>.
    <statements>.
ENDFORM.

I will use it as a minimal example. Here is my attempt at writing the corresponding type in haskell and the parser. The GenStatement -Constructor is for all other statements except function definition as described above.

module Main where

import Control.Applicative
import Data.Functor.Identity

import qualified Text.Parsec as P
import qualified Text.Parsec.String as S
import Text.Parsec.Language
import qualified Text.Parsec.Token as T

type Args = String
type Name = String

data AbapExpr -- ABAP Program
   = Form Name Args [AbapExpr]
   | GenStatement String [AbapExpr]
   deriving (Show, Read)

lexer :: T.TokenParser ()
lexer = T.makeTokenParser style
  where
    caseSensitive = False
    keys = ["form", "endform"]
    style = emptyDef
        { T.reservedNames = keys
        , T.identStart = P.alphaNum <|> P.char '_'
        , T.identLetter = P.alphaNum <|> P.char '_'
        }

dot :: S.Parser String
dot = T.dot lexer

reserved :: String -> S.Parser ()
reserved = T.reserved lexer

identifier :: S.Parser String
identifier = T.identifier lexer

argsP :: S.Parser String
argsP = P.manyTill P.anyChar (P.try (P.lookAhead dot))

genericStatementP :: S.Parser String
genericStatementP = P.manyTill P.anyChar (P.try dot)

abapExprP = P.try (P.between (reserved "form")
                             (reserved "endform" >> dot)
                             abapFormP)
    <|> abapStmtP
  where
    abapFormP = Form <$> identifier <*> argsP <* dot <*> many abapExprP
    abapStmtP = GenStatement <$> genericStatementP <*> many abapExprP

Testing the parser with the following input results in a strange behaviour.

-- a wrapper for convenience
parse :: S.Parser a -> String -> Either P.ParseError a
parse = flip P.parse "Test"

testParse1 = parse abapExprP "form foo arg1 arg2 arg2. form bar arg1. endform. endform."

results in

Right (GenStatement "form foo arg1 arg2 arg2" [GenStatement "form bar arg1" [GenStatement "endform" [GenStatement "endform" []]]])

so it seems the first brach always fails and only the second generic branch is successful. However if the second branch (parsing generic statements) is commented parsing forms suddenly succeeds:

abapExprP = P.try (P.between (reserved "form")
                             (reserved "endform" >> dot)
                             abapFormP)
    --    <|> abapStmtP
  where
    abapFormP = Form <$> identifier <*> argsP <* dot <*> many abapExprP
    -- abapStmtP = GenStatement <$> genericStatementP <*> many abapExprP

Now we get

 Right (Form "foo" "arg1 arg2 arg2" [Form "bar" "arg1" []])

How is this possible? It seems that the first branch succeeds so why doesn't it work in the first example - what am I missing?

Many thanks in advance!

Looks for me that your parser genericStatementP parses any character until a dot appears (you are using P.anyChar ). Hence it doesn't recognize the reserved keywords for your lexer.

I think you must define:

type Args = [String]

and:

argsP :: S.Parser [String]
argsP = P.manyTill identifier (P.try (P.lookAhead dot))

genericStatementP :: S.Parser String
genericStatementP = identifier

With these changes I get the following result:

Right (Form "foo" ["arg1","arg2","arg2"] [Form "bar" ["arg1"] []])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM