简体   繁体   中英

Parsing multiple lines into a list of lists in Haskell

I am trying to parse a file that looks like:

a b c 
f e d

I want to match each of the symbols in the line and parse everything into a list of lists such as:

[[A, B, C], [D, E, F]]

In order to do that I tried the following:

import           Control.Monad
import           Text.ParserCombinators.Parsec
import           Text.ParserCombinators.Parsec.Language
import qualified Text.ParserCombinators.Parsec.Token    as P

parserP :: Parser [[MyType]]
parserP = do
  x  <- rowP
  xs <- many (newline >> rowP)
  return (x : xs)

rowP :: Parser [MyType]
rowP = manyTill cellP $ void newline <|> eof

cellP :: Parser (Cell Color)
cellP = aP <|> bP <|> ... -- rest of the parsers, they all look very similar

aP :: Parser MyType
aP = symbol "a" >> return A

bP :: Parser MyType
bP = symbol "b" >> return B

lexer = P.makeTokenParser emptyDef
symbol  = P.symbol lexer

But it fails to return multiple inner lists. Instead what I get is:

[[A, B, C, D, E, F]]

What am I doing wrong? I was expecting manyTill to parse cellP until the newline character, but that's not the case.

Parser combinators are overkill for something this simple. I'd use lines :: String -> [String] and words :: String -> [String] to break up the input and then map the individual tokens into MyType s.

toMyType :: String -> Maybe MyType
toMyType "a" = Just A
toMyType "b" = Just B
toMyType "c" = Just C
toMyType _ = Nothing

parseMyType :: String -> Maybe [[MyType]]
parseMyType = traverse (traverse toMyType) . fmap words . lines

You're right that manyTill keeps parsing until a newline. But manyTill never gets to see the newline because cellP is too eager. cellP ends up calling P.symbol , whose documentation states

symbol :: String -> ParsecT sum String

Lexeme parser symbol s parses string s and skips trailing white space.

The keyword there is 'white space'. It turns out, Parsec defines whitespace as being any character which satisfies isSpace , which includes newlines. So P.symbol is happily consuming the c , followed by the space and the newline, and then manyTill looks and doesn't see a newline because it's already been consumed .

If you want to drop the Parsec routine, go with Benjamin's solution. But if you're determined to stick with that, the basic idea is that you want to modify the language's whiteSpace field to correctly define whitespace to not be newlines. Something like

lexer = let lexer0 = P.makeTokenParser emptyDef
        in lexer0 { whiteSpace = void $ many (oneOf " \t") }

That's pseudocode and probably won't work for your specific case, but the idea is there. You want to change the definition of whiteSpace to be whatever you want to define as whiteSpace , not what the system defines by default. Note that changing this will also break your comment syntax, if you have one defined, since whiteSpace was previously equipped to handle comments.

In short, Benjamin's answer is probably the best way to go. There's no real reason to use Parsec here. But it's also helpful to know why this particular solution didn't work: Parsec's default definition of a language wasn't designed to treat newlines with significance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM