[英]Haskell: Why isn't my parser backtracking properly?
I decided to teach myself how to use Parsec
, and I've hit a bit of a road block with the toy project I assigned myself. 我决定教自己如何使用
Parsec
,我为自己分配的玩具项目遇到了一些障碍。
I'm trying to parse HTML, specifically: 我正在尝试解析HTML,特别是:
<html>
<head>
<title>Insert Clever Title</title>
</head>
<body>
What don't you like?
<select id="some stuff">
<option name="first" font="green">boilerplate</option>
<option selected name="second" font="blue">parsing HTML with regexes</option>
<option name="third" font="red">closing tags for option elements
</select>
That was short.
</body>
</html>
My code is: 我的代码是:
{-# LANGUAGE FlexibleContexts, RankNTypes #-}
module Main where
import System.Environment (getArgs)
import Data.Map hiding (null)
import Text.Parsec hiding ((<|>), label, many, optional)
import Text.Parsec.Token
import Control.Applicative
data HTML = Element { tag :: String, attributes :: Map String (Maybe String), children :: [HTML] }
| Text { contents :: String }
deriving (Show, Eq)
type HTMLParser a = forall s u m. Stream s m Char => ParsecT s u m a
htmlDoc :: HTMLParser HTML
htmlDoc = do
spaces
doc <- html
spaces >> eof
return doc
html :: HTMLParser HTML
html = text <|> element
text :: HTMLParser HTML
text = Text <$> (many1 $ noneOf "<")
label :: HTMLParser String
label = many1 . oneOf $ ['a' .. 'z'] ++ ['A' .. 'Z']
value :: HTMLParser String
value = between (char '"') (char '"') (many anyChar) <|> label
attribute :: HTMLParser (String, Maybe String)
attribute = (,) <$> label <*> (optionMaybe $ spaces >> char '=' >> spaces >> value)
element :: HTMLParser HTML
element = do
char '<' >> spaces
tag <- label
-- at least one space between each attribute and what was before
attributes <- fromList <$> many (space >> spaces >> attribute)
spaces >> char '>'
-- nested html
children <- many html
optional $ string "</" >> spaces >> string tag >> spaces >> char '>'
return $ Element tag attributes children
main = do
source : _ <- getArgs
result <- parse htmlDoc source <$> readFile source
print result
The problem seems to be that my parser doesn't like closing tags - it seems to be greedily assuming <
always means an opening tag (as far as I can tell): 问题似乎是我的解析器不喜欢关闭标签-似乎贪婪地假设
<
总是意味着一个开始标签(据我所知):
% HTMLParser temp.html
Left "temp.html" (line 3, column 32):
unexpected "/"
expecting white space
I've been playing around with it for a bit, and I'm not sure why it's not backtracking past the char '<'
match. 我已经玩了一段时间,但我不确定为什么它没有回溯到
char '<'
比赛之后。
Like ehird said, I needed to use try: 就像ehird所说的,我需要使用try:
attribute = (,) <$> label <*> (optionMaybe . try $ spaces >> char '=' >> spaces >> value)
--...
attributes <- fromList <$> many (try $ space >> spaces >> attribute)
--...
children <- many $ try html
optional . try $ string "</" >> spaces >> string tag >> spaces >> char '>'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.