简体   繁体   English

解析器DCG是否适合不确定?

[英]Is it appropriate for a parser DCG to not be deterministic?

I am writing a parser for a query engine. 我正在为查询引擎编写解析器。 My parser DCG query is not deterministic. 我的解析器DCG query不是确定性的。

I will be using the parser in a relational manner, to both check and synthesize queries. 我将以关系方式使用解析器,以检查和合成查询。

Is it appropriate for a parser DCG to not be deterministic? 解析器DCG是否适合不确定?

In code: 在代码中:

If I want to be able to use query/2 both ways, does it require that 如果我希望能够以两种方式使用query / 2,那么它是否需要

?- phrase(query, [q,u,e,r,y]).
true;
false.

or should I be able to obtain 或者我应该能够获得

?- phrase(query, [q,u,e,r,y]).
true.

nevertheless, given that the first snippet would require me to use it as such 尽管如此,鉴于第一个片段需要我这样使用它

?- bagof(X, phrase(query, [q,u,e,r,y]), [true]).
true.

when using it to check a formula? 用它来检查配方?

The first question to ask yourself, is your grammar deterministic, or in the terminology of grammars, unambiguous . 第一个问自己的问题是,你的语法确定性,或者语法的术语,是明确的 This is not asking if your DCG is deterministic, but if the grammar is unambiguous. 这不是问你的DCG是否具有确定性,而是如果语法是明确的。 That can be answered with basic parsing concepts, no use of DCG is needed to answer that question. 这可以通过基本的解析概念来回答,不需要使用DCG来回答这个问题。 In other words, is there only one way to parse a valid input. 换句话说,是否只有一种方法可以解析有效输入。 The standard book for this is "Compilers : principles, techniques, & tools" ( WorldCat ) 标准书是“编译器:原理,技术和工具”( WorldCat

Now you are actually asking about three different uses for parsing. 现在,您实际上是在询问解析的三种不同用途。

  1. A recognizer. 识别器。
  2. A parser. 解析器。
  3. A generator. 一台发电机。

If your grammar is unambiguous then 如果你的语法是明确的那么

  1. For a recognizer the answer should only be true for valid input that can be parsed and false for invalid input. 对于识别器,答案应仅对于可以解析的有效输入为真,对于无效输入则为假。
  2. For the parser it should be deterministic as there is only one way to parse the input. 对于解析器,它应该是确定性的,因为只有一种方法来解析输入。 The difference between a parser and an recognizer is that a recognizer only returns true or false and a parser will return something more, typically an abstract syntax tree. 解析器和识别器之间的区别在于识别器仅返回true或false,解析器将返回更多内容,通常是抽象语法树。
  3. For the generator, it should be semi-deterministic so that it can generate multiple results. 对于生成器,它应该是半确定性的,以便它可以生成多个结果。

Can all of this be done with one, DCG, yes. 所有这一切都可以用一个DCG完成,是的。 The three different ways are dependent upon how you use the input and output of the DCG. 这三种不同的方式取决于您如何使用DCG的输入和输出。


Here is an example with a very simple grammar. 这是一个非常简单的语法示例。

The grammar is just an infix binary expression with one operator and two possible operands. 语法只是一个中缀二进制表达式,包含一个运算符和两个可能的操作数。 The operator is (+) and the operands are either (1) or (2). 运算符是(+),操作数是(1)或(2)。

expr(expr(Operand_1,Operator,Operand_2)) -->
    operand(Operand_1),
    operator(Operator),
    operand(Operand_2).

operand(operand(1)) --> "1".
operand(operand(2)) --> "2".

operator(operator(+)) --> "+".

recognizer(Input) :-
    string_codes(Input,Codes),
    DCG = expr(_),
    phrase(DCG,Codes,[]).

parser(Input,Ast) :-
    string_codes(Input,Codes),
    DCG = expr(Ast),
    phrase(DCG,Codes,[]).

generator(Generated) :-
    DCG = expr(_),
    phrase(DCG,Codes,[]),
    string_codes(Generated,Codes).

:- begin_tests(expr).

recognizer_test_case_success("1+1").
recognizer_test_case_success("1+2").
recognizer_test_case_success("2+1").
recognizer_test_case_success("2+2").

test(recognizer,[ forall(recognizer_test_case_success(Input)) ] ) :-
    recognizer(Input).

recognizer_test_case_fail("2+3").

test(recognizer,[ forall(recognizer_test_case_fail(Input)), fail ] ) :-
    recognizer(Input).

parser_test_case_success("1+1",expr(operand(1),operator(+),operand(1))).
parser_test_case_success("1+2",expr(operand(1),operator(+),operand(2))).
parser_test_case_success("2+1",expr(operand(2),operator(+),operand(1))).
parser_test_case_success("2+2",expr(operand(2),operator(+),operand(2))).

test(parser,[ forall(parser_test_case_success(Input,Expected_ast)) ] ) :-
    parser(Input,Ast),
    assertion( Ast == Expected_ast).

parser_test_case_fail("2+3").

test(parser,[ forall(parser_test_case_fail(Input)), fail ] ) :-
    parser(Input,_).

test(generator,all(Generated == ["1+1","1+2","2+1","2+2"]) ) :-
    generator(Generated).

:- end_tests(expr).

The grammar is unambiguous and has only 4 valid strings which are all unique. 语法是明确的,只有4个有效的字符串,都是唯一的。

The recognizer is deterministic and only returns true or false. 识别器是确定性的,只返回true或false。
The parser is deterministic and returns a unique AST. 解析器是确定性的并返回唯一的AST。
The generator is semi-deterministic and returns all 4 valid unique strings. 生成器是半确定性的,并返回所有4个有效的唯一字符串。

Example run of the test cases. 示例运行测试用例。

?- run_tests.
% PL-Unit: expr ........... done
% All 11 tests passed
true.

To expand a little on the comment by Daniel 为了扩大丹尼尔的评论

As Daniel notes 正如丹尼尔所说

1 + 2 + 3 

can be parsed as 可以解析为

(1 + 2) + 3 

or 要么

1 + (2 + 3)

So 1+2+3 is an example as you said is specified by a recursive DCG and as I noted a common way out of the problem is to use parenthesizes to start a new context. 所以1+2+3是一个例子,正如你所说is specified by a recursive DCG ,正如我所指出的,解决问题的一个常见方法是使用括号来启动新的上下文。 What is meant by starting a new context is that it is like getting a new clean slate to start over again. 开始一个新的背景是什么意思,就像重新开始一个新的清单 If you are creating an AST, you just put the new context, items in between the parenthesizes, as a new subtree at the current node. 如果要创建AST,只需将新上下文(在括号之间)作为当前节点的新子树放入。

With regards to write_canonical/1 , this is also helpful but be aware of left and right associativity of operators. 关于write_canonical / 1 ,这也很有用,但要注意运算符的左右关联性。 See Associative property 请参阅关联属性

eg 例如

+ is left associative +是左关联的

?- write_canonical(1+2+3).
+(+(1,2),3)
true.

^ is right associative ^是正确的关联

?- write_canonical(2^3^4).
^(2,^(3,4))
true.

ie

2^3^4 = 2^(3^4) = 2^81 = 2417851639229258349412352

2^3^4 != (2^3)^4 = 8^4 = 4096

The point of this added info is to warn you that grammar design is full of hidden pitfalls and if you have not had a rigorous class in it and done some of it you could easily create a grammar that looks great and works great and then years latter is found to have a serious problem. 这个附加信息的重点是警告你语法设计充满了隐藏的陷阱,如果你没有严格的课程并完成其中的一些,你可以很容易地创建一个看起来很棒并且效果很好的语法然后几年被发现有一个严重的问题。 While Python was not ambiguous AFAIK, it did have grammar issues, it had enough issues that when Python 3 was created, many of the issues were fixed. 虽然Python不是模棱两可的AFAIK,但确实存在语法问题,它有足够的问题,当创建Python 3时,许多问题都得到了解决。 So Python 3 is not backward compatible with Python 2 ( differences ). 因此Python 3不向后兼容Python 2( 差异 )。 Yes they have made changes and libraries to make it easier to use Python 2 code with Python 3, but the point is that the grammar could have used a bit more analysis when designed. 是的,他们已经进行了更改和库,以便更容易在Python 3中使用Python 2代码,但重点是语法在设计时可能会使用更多的分析。

The only reason why code should be non-deterministic is that your question has multiple answers. 代码应该是非确定性的唯一原因是您的问题有多个答案。 In that case, you'd of course want your query to have multiple solutions. 在这种情况下,您当然希望您的查询有多个解决方案。 Even then, however, you'd like it to not leave a choice point after the last solution, if at all possible. 然而,即便如此,如果可能的话,你还是希望在最后的解决方案之后留下选择点。

Here is what I mean: 这就是我的意思:

"What is the smaller of two numbers?" “两个数字中较小的数字是多少?”

min_a(A, B, B) :- B < A.
min_a(A, B, A) :- A =< B.

So now you ask, "what is the smaller of 1 and 2" and the answer you expect is "1": 所以现在你问,“1和2中的较小者是什么”,你期望的答案是“1”:

?- min_a(1, 2, Min).
Min = 1.

?- min_a(2, 1, Min).
Min = 1 ; % crap...
false.

?- min_a(2, 1, 2).
false.

?- min_a(2, 1, 1).
true ; % crap...
false.

So that's not bad code but I think it's still crap. 所以这不是坏代码,但我认为它仍然是废话。 This is why, for the smaller of two numbers, you'd use something like the min() function in SWI-Prolog . 这就是为什么,对于两个数字中较小的一个,你会使用像SWI-Prolog中的min()函数

Similarly, say you want to ask, "What are the even numbers between 1 and 10"; 同样地,你想问一下,“1到10之间的偶数是多少”; you write the query: 你写的查询:

?- between(1, 10, X), X rem 2 =:= 0.
X = 2 ;
X = 4 ;
X = 6 ;
X = 8 ;
X = 10.

... and that's fine, but if you then ask for the numbers that are multiple of 3, you get: ......那没关系,但是如果你要求数字是3的倍数,你会得到:

?- between(1, 10, X), X rem 3 =:= 0.
X = 3 ;
X = 6 ;
X = 9 ;
false. % crap...

The "low-hanging fruit" are the cases where you as a programmer would see that there cannot be non-determinism, but for some reason your Prolog is not able to deduce that from the code you wrote. “低悬的果实”是你作为程序员会发现不存在非确定性的情况,但由于某些原因,你的Prolog无法从你编写的代码中推断出它。 In most cases, you can do something about it. 在大多数情况下,您可以对此采取一些措施。

On to your actual question. 关于你的实际问题。 If you can, write your code so that there is non-determinism only if there are multiple answers to the question you'll be asking . 如果可以,请编写代码,以便只有在您要问的问题有多个答案时才存在非确定性。 When you use a DCG for both parsing and generating, this sometimes means you end up with two code paths. 当您使用DCG进行解析和生成时,这有时意味着您最终会有两个代码路径。 It feels clumsy but it is easier to write, to read, to understand, and probably to make efficient. 它感觉很笨拙但是更容易编写,阅读,理解,并且可能更有效。 As a word of caution, take a look at this question . 谨慎一点,看看这个问题 I can't know that for sure, but the problems that OP is running into are almost certainly caused by unnecessary non-determinism. 我无法确切地知道,但OP遇到的问题几乎肯定是由不必要的非决定论引起的。 What probably happens with larger inputs is that a lot of choice points are left behind, there is a lot of memory that cannot be reclaimed, a lot of processing time going into book keeping, huge solution trees being traversed only to get (as expected) no solutions.... you get the point. 大输入可能发生的事情是留下了很多选择点,有大量内存无法回收,大量处理时间进入簿记状态,大量解决方案树只被遍历(如预期的那样)没有解决方案......你明白了。

For examples of what I mean, you can take a look at the implementation of library(dcg/basics) in SWI-Prolog . 有关我的意思的例子,你可以看一下SWI-Prolog中库(dcg / basics)的实现 Pay attention to several things: 注意几件事:

  • The documentation is very explicit about what is deterministic, what isn't, and how non-determinism is supposed to be useful to the client code; 文档非常明确地指出什么是确定性的,什么是非确定性的,以及非确定性如何对客户端代码有用;
  • The use of cuts, where necessary, to get rid of choice points that are useless; 必要时使用削减来摆脱无用的选择点;
  • The implementation of number//1 (towards the bottom) that can "generate extract a number". number//1 (朝向底部)的实现可以“生成提取数字”。

(Hint: use the primitives in this library when you write your own parser!) (提示:当你编写自己的解析器时,使用这个库中的原语!)

I hope you find this unnecessarily long answer useful. 我希望你发现这个不必要的长答案很有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM