简体   繁体   English

纠正Haskell中的ReadP用法

[英]Correct ReadP usage in Haskell

I did a very simple parser for lists of numbers in a file, using ReadP in Haskell. 我在Haskell中使用ReadP为文件中的数字列表做了一个非常简单的解析器。 It works, but it is very slow... is this normal behavior of this type of parser or am I doing something wrong? 它工作,但它很慢......这种类型的解析器的这种正常行为还是我做错了什么?

import Text.ParserCombinators.ReadP
import qualified Data.IntSet as IntSet
import Data.Char

setsReader :: ReadP [ IntSet.IntSet ]
setsReader = 
    setReader `sepBy` ( char '\n' )

innocentWhitespace :: ReadP ()
innocentWhitespace = 
    skipMany $ (char ' ') <++ (char '\t' )

setReader :: ReadP IntSet.IntSet
setReader =  do 
    innocentWhitespace
    int_list <- integerReader `sepBy1`  innocentWhitespace
    innocentWhitespace 
    return $ IntSet.fromList int_list

integerReader :: ReadP Int
integerReader = do
    digits <- many1 $ satisfy isDigit 
    return $ read digits

readClusters:: String -> IO [ IntSet.IntSet ]
readClusters filename = do 
    whole_file <- readFile filename 
    return $ ( fst . last ) $ readP_to_S setsReader whole_file 

setReader has exponential behavior, because it is allowing the whitespace between the numbers to be optional . setReader具有指数行为,因为它允许数字之间的空格是可选的 So for the line: 所以对于这条线:

12 34 56

It is seeing these parses: 它看到这些解析:

[1,2,3,4,5,6]
[12,3,4,5,6]
[1,2,34,5,6]
[12,34,5,6]
[1,2,3,4,56]
[12,3,4,56]
[1,2,34,56]
[12,34,56]

You could see how this could get out of hand for long lines. 你可以看到这对于长线来说可能会失控。 ReadP returns all valid parses in increasing length order, so to get to the last parse you have to traverse through all these intermediate parses. ReadP以递增的长度顺序返回所有有效的解析,因此要到达最后一个解析,您必须遍历所有这些中间解析。 Change: 更改:

int_list <- integerReader `sepBy1` innocentWhitespace

To: 至:

int_list <- integerReader `sepBy1` mandatoryWhitespace

For a suitable definition of mandatoryWhitespace to squash this exponential behavior. 对于mandatoryWhitespace的合适定义来压缩这种指数行为。 The parsing strategy used by parsec is more resistant to this kind of error, because it is greedy -- once it consumes input in a given branch, it is committed to that branch and never goes back (unless you explicitly asked it to). parsec使用的解析策略更能抵抗这种错误,因为它是贪婪的 - 一旦它消耗给定分支中的输入,它就会被提交到该分支并且永远不会返回(除非您明确要求它)。 So once it correctly parsed 12 , it would never go back to parse 1 2 . 因此,一旦正确解析了12 ,它就永远不会回到解析1 2 Of course that means it matters in which order you state your choices, which I always find to be a bit of a pain to think about. 当然,这意味着你说出你的选择的顺序很重要,我总是觉得有点难以思考。

Also I would use: 我也会用:

head [ x | (x,"") <- readP_to_S setsReader whole_file ]

To extract a valid whole-file parse, in case it very quickly consumed all input but there were a hundred bazillion ways to interpret that input. 要提取有效的整个文件解析,以防它非常快速地消耗所有输入,但是有数百种方法来解释该输入。 If you don't care about the ambiguity, you would probably rather it return the first one than the last one, because the first one will arrive faster. 如果你不关心歧义,你可能宁愿它返回第一个而不是最后一个,因为第一个会更快到达。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM