简体   繁体   English

你如何以贪婪的方式使用 parsec?

[英]How do you use parsec in a greedy fashion?

In my work I come across a lot of gnarly sql, and I had the bright idea of writing a program to parse the sql and print it out neatly.在我的工作中,我遇到了很多粗糙的 sql,我有一个聪明的想法,就是编写一个程序来解析 sql 并整齐地打印出来。 I made most of it pretty quickly, but I ran into a problem that I don't know how to solve.我很快就完成了大部分工作,但遇到了一个我不知道如何解决的问题。

So let's pretend the sql is "select foo from bar where 1".所以让我们假设 sql 是“从 1 的 bar 中选择 foo”。 My thought was that there is always a keyword followed by data for it, so all I have to do is parse a keyword, and then capture all gibberish before the next keyword and store that for later cleanup, if it is worthwhile.我的想法是,总是有一个关键字后跟数据,所以我所要做的就是解析一个关键字,然后在下一个关键字之前捕获所有乱码并将其存储起来以供以后清理,如果值得的话。 Here's the code:这是代码:

import Text.Parsec
import Text.Parsec.Combinator
import Text.Parsec.Char
import Data.Text (strip)

newtype Statement = Statement [Atom]
data Atom = Branch String [Atom] | Leaf String deriving Show

trim str = reverse $ trim' (reverse $ trim' str)
  where
    trim' (' ':xs) = trim' xs
    trim' str = str

printStatement atoms = mapM_ printAtom atoms
printAtom atom = loop 0 atom 
  where
    loop depth (Leaf str) = putStrLn $ (replicate depth ' ') ++ str
    loop depth (Branch str atoms) = do 
      putStrLn $ (replicate depth ' ') ++ str
      mapM_ (loop (depth + 2)) atoms

keywords :: [String]
keywords = [
  "select",
  "update",
  "delete",
  "from",
  "where"]

keywordparser :: Parsec String u String
keywordparser = try ((choice $ map string keywords) <?> "keywordparser")

stuffparser :: Parsec String u String
stuffparser = manyTill anyChar (eof <|> (lookAhead keywordparser >> return ()))

statementparser = do
  key <- keywordparser
  stuff <- stuffparser
  return $ Branch key [Leaf (trim stuff)]
  <?> "statementparser"

tp = parse (many statementparser) ""

The key here is the stuffparser.这里的关键是 stuffparser。 That is the stuff in between the keywords that could be anything from column lists to where criteria.那是关键字之间的东西,可以是从列列表到条件的任何内容。 This function catches all characters leading up to a keyword.此 function 捕获通向关键字的所有字符。 But it needs something else before it is finished.但在完成之前它还需要其他东西。 What if there is a subselect?如果有子选择怎么办? "select id,(select product from products) from bar". “从栏选择 id,(从产品中选择产品)”。 Well in that case if it hits that keyword, it screws everything up, parses it wrong and screws up my indenting.好吧,在那种情况下,如果它碰到那个关键字,它就会搞砸一切,解析错误并搞砸我的缩进。 Also where clauses can have parenthesis as well. where 子句也可以有括号。

So I need to change that anyChar into another combinator that slurps up characters one at a time but also tries to look for parenthesis, and if it finds them, traverse and capture all that, but also if there are more parenthesis, do that until we have fully closed the parenthesis, then concatenate it all and return it.因此,我需要将 anyChar 更改为另一个组合器,它一次吞下一个字符,但也尝试查找括号,如果找到它们,遍历并捕获所有这些,但如果有更多括号,这样做直到我们已完全关闭括号,然后将其全部连接并返回。 Here's what I've tried, but I can't quite get it to work.这是我尝试过的,但我无法让它发挥作用。

stuffparser :: Parsec String u String
stuffparser = fmap concat $ manyTill somechars (eof <|> (lookAhead keywordparser >> return ()))
  where
    somechars = parens <|> fmap (\c -> [c]) anyChar
    parens= between (char '(') (char ')') somechars

This will error like so:这将像这样错误:

> tp "select asdf(qwerty) from foo where 1"
Left (line 1, column 14):
unexpected "w"
expecting ")"

But I can't think of any way to rewrite this so that it works.但是我想不出任何方法来重写它以使其起作用。 I've tried to use manyTill on the parenthesis part, but I end up having trouble getting it to typecheck when I have both string producing parens and single chars as alternatives.我尝试在括号部分使用 manyTill,但是当我同时使用字符串生成括号和单个字符作为替代时,我最终无法对其进行类型检查。 Does anyone have any suggestions on how to go about this?有人对如何 go 有任何建议吗?

Yeah, between might not work for what you're looking for.是的, between可能不适用于您要查找的内容。 Of course, for your use case, I'd follow hammar's suggestion and grab an off-the-shelf SQL parser.当然,对于您的用例,我会遵循 hammar 的建议并获取现成的 SQL 解析器。 (personal opinion: or, try not to use SQL unless you really have to; the idea to use strings for database queries was imho a historical mistake). (个人意见:或者,除非你真的必须这样做,否则尽量不要使用 SQL;使用字符串进行数据库查询的想法是一个历史错误)。

Note: I add an operator called <++> which will concatenate the results of two parsers, whether they are strings or characters.注意:我添加了一个名为<++>的运算符,它将连接两个解析器的结果,无论它们是字符串还是字符。 (code at bottom.) (代码在底部。)

First, for the task of parsing parenthesis: the top level will parse some stuff between the relevant characters, which is exactly what the code says,首先,对于解析括号的任务:顶层会解析相关字符之间的一些东西,这正是代码所说的,

parseParen = char '(' <++> inner <++> char ')'

Then, the inner function should parse anything else: non-parens, possibly including another set of parenthesis, and non-paren junk that follows.然后, inner function 应该解析任何其他内容:非括号,可能包括另一组括号,以及随后的非括号垃圾。

parseParen = char '(' <++> inner <++> char ')' where
    inner = many (noneOf "()") <++> option "" (parseParen <++> inner)

I'll make the assumption that for the rest of the solution, what you want to do is analgous to splitting things up by top-level SQL keywords.我将假设对于解决方案的 rest,您想要做的是类似于通过顶级 SQL 关键字拆分事物。 (ie ignoring those in parenthesis). (即忽略括号中的那些)。 Namely, we'll have a parser that will behave like so,也就是说,我们将有一个解析器,它的行为会像这样,

Main> parseTest parseSqlToplevel "select asdf(select m( 2) fr(o)m w where n) from b where delete 4"
[(Select," asdf(select m( 2) fr(o)m w where n) "),(From," b "),(Where," "),(Delete," 4")]

Suppose we have a parseKw parser that will get the likes of select , etc. After we consume a keyword, we need to read until the next [top-level] keyword.假设我们有一个parseKw解析器,它将获得select等。在我们消费一个关键字之后,我们需要阅读直到下一个 [顶级] 关键字。 The last trick to my solution is using the lookAhead combinator to determine whether the next word is a keyword, and put it back if so.我的解决方案的最后一个技巧是使用lookAhead组合器来确定下一个单词是否是关键字,如果是,则将其放回原处。 If it's not, then we consume a parenthesis or other character, and then recurse on the rest.如果不是,则我们使用括号或其他字符,然后在 rest 上递归。

-- consume spaces, then eat a word or parenthesis
parseOther = many space <++>
    (("" <$ lookAhead (try parseKw)) <|> -- if there's a keyword, put it back!
     option "" ((parseParen <|> many1 (noneOf "() \t")) <++> parseOther))

My entire solution is as follows我的整个解决方案如下

-- overloaded operator to concatenate string results from parsers
class CharOrStr a where toStr :: a -> String
instance CharOrStr Char where toStr x = [x]
instance CharOrStr String where toStr = id
infixl 4 <++>
f <++> g = (\x y -> toStr x ++ toStr y) <$> f <*> g

data Keyword = Select | Update | Delete | From | Where deriving (Eq, Show)

parseKw =
    (Select <$ string "select") <|>
    (Update <$ string "update") <|>
    (Delete <$ string "delete") <|>
    (From <$ string "from") <|>
    (Where <$ string "where") <?>
    "keyword (select, update, delete, from, where)"

-- consume spaces, then eat a word or parenthesis
parseOther = many space <++>
    (("" <$ lookAhead (try parseKw)) <|> -- if there's a keyword, put it back!
     option "" ((parseParen <|> many1 (noneOf "() \t")) <++> parseOther))

parseSqlToplevel = many ((,) <$> parseKw <*> (space <++> parseOther)) <* eof

parseParen = char '(' <++> inner <++> char ')' where
    inner = many (noneOf "()") <++> option "" (parseParen <++> inner)

edit - version with quote support编辑 - 带有报价支持的版本

you can do the same thing as with the parens to support quotes,你可以用括号做同样的事情来支持引号,

import Control.Applicative hiding (many, (<|>))
import Text.Parsec
import Text.Parsec.Combinator

-- overloaded operator to concatenate string results from parsers
class CharOrStr a where toStr :: a -> String
instance CharOrStr Char where toStr x = [x]
instance CharOrStr String where toStr = id
infixl 4 <++>
f <++> g = (\x y -> toStr x ++ toStr y) <$> f <*> g

data Keyword = Select | Update | Delete | From | Where deriving (Eq, Show)

parseKw =
    (Select <$ string "select") <|>
    (Update <$ string "update") <|>
    (Delete <$ string "delete") <|>
    (From <$ string "from") <|>
    (Where <$ string "where") <?>
    "keyword (select, update, delete, from, where)"

-- consume spaces, then eat a word or parenthesis
parseOther = many space <++>
    (("" <$ lookAhead (try parseKw)) <|> -- if there's a keyword, put it back!
     option "" ((parseParen <|> parseQuote <|> many1 (noneOf "'() \t")) <++> parseOther))

parseSqlToplevel = many ((,) <$> parseKw <*> (space <++> parseOther)) <* eof

parseQuote = char '\'' <++> inner <++> char '\'' where
    inner = many (noneOf "'\\") <++>
        option "" (char '\\' <++> anyChar <++> inner)

parseParen = char '(' <++> inner <++> char ')' where
    inner = many (noneOf "'()") <++>
        (parseQuote <++> inner <|> option "" (parseParen <++> inner))

I tried it with parseTest parseSqlToplevel "select ('a(sdf'())b" . cheers我用parseTest parseSqlToplevel "select ('a(sdf'())b"试过了。干杯

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM