简体   繁体   English

Haskell中的字符串解析

[英]String parsing in Haskell

I am very new to Haskell and am currently trying to solve a problem that requires some string parsing. 我是Haskell的新手,目前正在尝试解决需要进行字符串解析的问题。 My input String contains a comma-delimited list of words in quotes. 我的输入String包含以逗号分隔的引号中的单词列表。 I want to parse this single string into a list of the words as Strings. 我想将这个单个字符串解析为字符串列表。 Where should I start learning about parsing such a String? 我应该从哪里开始学习解析这样的String? Is there a partuclar module and/or functions that will be helpful? 是否有一个有用的partuclar模块和/或功能?

ps Please don't post a full solution. ps请不要发布完整的解决方案。 I am just asking for a pointer to a starting place so I can learn how to do it. 我只是想要一个指向起始位置的指针,这样我就可以学习如何做到这一点。

The most powerful solution is a parser combinator. 最强大的解决方案是解析器组合器。 Haskell has several of these, but the foremost that come to my mind are: Haskell有几个,但最重要的是我想到的:

  • parsec : a very good general-purpose parsing library parsec :一个非常好的通用解析库
  • attoparsec : a faster version of parsec, which sacrifices the quality of error messages and some other features for extra speed attoparsecparsec的更快版本,它牺牲了错误消息的质量和一些其他功能以提高速度
  • uu-parsinglib : a very powerful parsing library uu-parsinglib :一个非常强大的解析库

The big advantage of parser combinators is that it is very easy to define parsers using do notation (or Applicative style, if you prefer). 解析器组合的一大优点是,它是非常容易使用定义解析器do记号(或Applicative的风格,如果你喜欢)。

If you just want some quick and simple string manipulation capabilities, then consult the text library (for high-performance byte-encoded strings), or Data.List (for ordinary list-encoded strings), which provide the necessary functions to manipulate strings. 如果您只想要一些快速简单的字符串操作功能,那么请查阅text库(用于高性能字节编码字符串)或Data.List (用于普通列表编码字符串),它们提供操作字符串所需的功能。

I finally decided to roll my own parsing functions since this is such a simple situation. 我最终决定推出自己的解析函数,因为这是一个非常简单的情况。 I have learned a lot about Haskell since I first posted this question and want to document my solution here: 自从我第一次发布这个问题并想在此处记录我的解决方案后,我学到了很多关于Haskell的知识:

split :: Char -> String -> [String]
split _ "" = []
split c s = firstWord : (split c rest)
    where firstWord = takeWhile (/=c) s
          rest = drop (length firstWord + 1) s

removeChar :: Char -> String -> String
removeChar _ [] = []
removeChar ch (c:cs)
    | c == ch   = removeChar ch cs
    | otherwise = c:(removeChar ch cs)

main = do
    handle <- openFile "input/names.txt" ReadMode
    contents <- hGetContents handle
    let names = sort (map (removeChar '"') (split ',' contents))
    print names
    hClose handle

Since String s are simply lists of Char s in Haskell, Data.List would be a good place to start looking (in the interest of learning Haskell). 由于String只是Haskell中Char的列表,因此Data.List将是一个开始寻找的好地方(为了学习Haskell)。

For more complex cases (where commas may be nested inside quotes and should be ignored, for example), parsec (as Daniel mentioned) would be a better solution. 对于更复杂的情况(例如,逗号可以嵌套在引号内并且应该被忽略), parsec (如Daniel所提到的)将是更好的解决方案。

Also, if you're looking to parse CSVs you may try Text.CSV , though I've not tried it, so I can't say how helpful it'll be. 此外,如果您正在寻找解析CSV,您可以尝试Text.CSV ,虽然我没有尝试过,所以我不能说它会有多大帮助。

Here's a particularly cheeky way to proceed: 这是一种特别厚颜无耻的方式:

parseCommaSepQuotedWords :: String -> [String]
parseCommaSepQuotedWords s = read ("[" ++ s ++ "]")

This might work but it's very fragile and rather silly. 这可能会奏效,但它非常脆弱而且相当愚蠢。 Essentially you are using the fact that the Haskell way of writing lists of strings almost coincides with your way, and hence the built-in Read instance is almost the thing you want. 基本上你使用的事实是,Haskell编写字符串列表的方式几乎与你的方式一致,因此内置的Read实例几乎就是你想要的东西。 You could use reads for better error-reporting but in reality you probably want to do something else entirely. 您可以使用reads来更好地报告错误,但实际上您可能希望完全执行其他操作。

In general, parsec is really worth taking a look at - it's a joy to use, and one of the things that originally really got me excited about Haskell. 一般来说, parsec 真的值得一看 - 使用它是一件令人愉快的事情,其中​​一件原本让我对Haskell感到兴奋的事情。 But if you want a homegrown solution, I often write simple things using case statements on the result of span and break . 但是如果你想要一个自己开发的解决方案,我经常使用case语句在spanbreak的结果上编写简单的东西。 Suppose you are looking for the next semicolon in the input. 假设您正在寻找输入中的下一个分号。 Then break (== ';') inp will return (before, after) , where: 然后break (== ';') inp将返回(before, after) ,其中:

  • before is the content of inp up to (and not including) the first semicolon (or all of it if there is none) beforeinp的内容直到(并且不包括)第一个分号(如果没有,则为全部分号)
  • after is the rest of the string: after是字符串的其余部分:
    • if after is not empty, the first element is a semicolon 如果after不为空,则第一个元素是分号
    • regardless of what else happens, before ++ after == inp 无论发生什么, before ++ after == inp

So to parse a list of statements separated by semicolons, I might do this: 因此,要解析由分号分隔的语句列表,我可能会这样做:

parseStmts :: String -> Maybe [Stmt]
parseStmts inp = case break (== ';') inp of
  (before, _ : after) -> -- ...
    -- ^ before is the first statement
    --     ^ ignore the semicolon
    --           ^ after is the rest of the string
  (_, []) -> -- inp doesn't contain any semicolons

为了对发生在这个问题上的人有一个完整的答案, Data.Text也有一些很好的功能。

Use parsec for anything that that is 'real work'. 使用parsec进行任何“真正的工作”。

For a introduction read https://therning.org/magnus/archives/tag/parsec 有关介绍,请阅读https://therning.org/magnus/archives/tag/parsec

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM