简体   繁体   English

怎样才能提高效率?

[英]How can lexing efficiency be improved?

In parsing a large 3 gigabyte file with DCG , efficiency is of importance. 在使用DCG解析大型3千兆字节文件时,效率非常重要。

The current version of my lexer is using mostly the or predicate ;/2 but I read that indexing can help. 我的词法分析器的当前版本主要使用或谓词; / 2但我读到索引可以提供帮助。

Indexing is a technique used to quickly select candidate clauses of a predicate for a specific goal. 索引是一种用于快速为特定目标选择谓词的候选子句的技术。 In most Prolog systems, indexing is done (only) on the first argument of the head. 在大多数Prolog系统中,索引仅在头部的第一个参数上完成。 If this argument is instantiated to an atom, integer, float or compound term with functor, hashing is used to quickly select all clauses where the first argument may unify with the first argument of the goal. 如果将此参数实例化为带有functor的atom,integer,float或复合项,则使用散列快速选择第一个参数可以与目标的第一个参数统一的所有子句。 SWI-Prolog supports just-in-time and multi-argument indexing. SWI-Prolog支持即时和多参数索引。 See section 2.18 . 2.18节。

Can someone give an example of using indexing for lexing and possibly explain how it improves efficiency? 有人可以给出一个使用索引进行lexing的例子,并可能解释它如何提高效率?


Details 细节

Note: I changed some of the names before coping the source code into this question. 注意:在将源代码复制到此问题之前,我更改了一些名称。 If you find a mistake feel free to edit it here or leave me a comment and I will gladly fix it. 如果您发现错误,请随时在此处进行编辑或留下评论,我很乐意解决。

Currently my lexer/tokenizer (based on mzapotoczny/ prolog-interpreter parser.pl ) is this 目前我的lexer / tokenizer(基于mzapotoczny / prolog-interpreter parser.pl )就是这个

% N.B.
% Since the lexer uses "" for values, the double_quotes flag has to be set to `chars`.
% If double_quotes flag is set to `code`, the the values with "" will not be matched.

:- use_module(library(pio)). 
:- use_module(library(dcg/basics)).
:- set_prolog_flag(double_quotes,chars).

lexer(Tokens) -->
   white_space,
   (
       (  ":",       !, { Token = tokColon }
      ;  "(",       !, { Token = tokLParen }
      ;  ")",       !, { Token = tokRParen }
      ;  "{",       !, { Token = tokLMusta}
      ;  "}",       !, { Token = tokRMusta}
      ;  "\\",      !, { Token = tokSlash}
      ;  "->",      !, { Token = tokImpl}
      ;  "+",       !, { Token = tokPlus }
      ;  "-",       !, { Token = tokMinus }
      ;  "*",       !, { Token = tokTimes }
      ;  "=",       !, { Token = tokEqual }
      ;  "<",       !, { Token = tokLt }
      ;  ">",       !, { Token = tokGt }
      ;  "_",       !, { Token = tokUnderscore }
      ;  ".",       !, { Token = tokPeriod }
      ;  "/",       !, { Token = tokForwardSlash }
      ;  ",",       !, { Token = tokComma }
      ;  ";",       !, { Token = tokSemicolon }
      ;  digit(D),  !,
            number(D, N),
            { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
            {  member((Id, Token), [ (div, tokDiv),
                                     (mod, tokMod),
                                     (where, tokWhere)]),
               !
            ;  Token = tokVar(Id)
            }
      ;  [_],
            { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;  [],
         { Tokens = [] }
   ).

white_space -->
   [Char], { code_type(Char, space) }, !, white_space.
white_space -->
    "--", whole_line, !, white_space.
white_space -->
   [].

whole_line --> "\n", !.
whole_line --> [_], whole_line.

digit(D) -->
   [D],
      { code_type(D, digit) }.

digits([D|T]) -->
   digit(D),
   !,
   digits(T).
digits([]) -->
   [].

number(D, N) -->
   digits(Ds),
      { number_chars(N, [D|Ds]) }.

letter(L) -->
   [L], { code_type(L, alpha) }.

alphanum([A|T]) -->
   [A], { code_type(A, alnum) }, !, alphanum(T).
alphanum([]) -->
   [].

alphanum([]).
alphanum([H|T]) :- code_type(H, alpha), alphanum(T).

identifier(L, Id) -->
   alphanum(As),
      { atom_codes(Id, [L|As]) }.

Here are some helper predicates used for development and testing. 以下是一些用于开发和测试的辅助谓词。

read_file_for_lexing_and_user_review(Path) :-
    open(Path,read,Input),
    read_input_for_user_review(Input), !,
    close(Input).

read_file_for_lexing_and_performance(Path,Limit) :-
    open(Path,read,Input),
    read_input_for_performance(Input,0,Limit), !,
    close(Input).

read_input(Input) :-
    at_end_of_stream(Input).

read_input(Input) :-
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line(Line),
    read_input(Input).

read_input_for_user_review(Input) :-
    at_end_of_stream(Input).

read_input_for_user_review(Input) :-
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line_for_user_review(Line),
    nl,
    print('Press spacebar to continue or any other key to exit: '),
    get_single_char(Key),
    process_user_continue_or_exit_key(Key,Input).

read_input_for_performance(Input,Count,Limit) :-
    Count >= Limit.

read_input_for_performance(Input,_,_) :-
    at_end_of_stream(Input).

read_input_for_performance(Input,Count0,Limit) :-
    % print(Count0),
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line(Line),
    Count is Count0 + 1,
    read_input_for_performance(Input,Count,Limit).

process_user_continue_or_exit_key(32,Input) :-  % space bar
    nl, nl,
    read_input_for_user_review(Input).

process_user_continue_or_exit_key(Key) :-
    Key \= 32.

lex_line_for_user_review(Line) :-
    lex_line(Line,TokList),
    print(Line),
    nl,
    print(TokList),
    nl.

lex_line(Line,TokList) :-
    string_chars(Line,Code_line),
    phrase(lexer(TokList),Code_line).

lex_line(Line) :-
    string_chars(Line,Code_line),
    phrase(lexer(TokList),Code_line).

read_user_input_for_lexing_and_user_review :-
    print('Enter a line to parse or just Enter to exit: '),
    nl,
    read_string(user, "\n", "\r", _, String),
    nl,
    lex_line_for_user_review(String),
    nl,
    continue_user_input_for_lexing_and_user_review(String).

continue_user_input_for_lexing_and_user_review(String) :-
    string_length(String,N),
    N > 0,
    read_user_input_for_lexing_and_user_review.

continue_user_input_for_lexing_and_user_review(String) :-
    string_length(String,0).

read_user_input_for_lexing_and_user_review/0 allows a user to enter a string at the terminal for lexing and review the tokens. read_user_input_for_lexing_and_user_review/0允许用户在终端输入用于lexing的字符串并查看令牌。

read_file_for_lexing_and_user_review/1 Reads a file for lexing and review the tokens for each line one line at a time. read_file_for_lexing_and_user_review/1读取lexing文件并一次查看每行一行的标记。

read_file_for_lexing_and_performance/2 Reads a file for lexing with a limit on the number of lines to lex. read_file_for_lexing_and_performance/2读取lexing文件,限制lex的行数。 This is for use with gathering basic performance statistics to measure efficiency. 这用于收集基本性能统计数据以衡量效率。 Meant to be used with time/1 . 意味着与时间/ 1一起使用。

Solution: 解:

You should replace the following: 您应该替换以下内容:

lexer(Tokens) -->
   white_space,
   (
      (  ":",       !, { Token = tokColon }
      ;  "(",       !, { Token = tokLParen }
      ;  ")",       !, { Token = tokRParen }
      ;  "{",       !, { Token = tokLMusta}
      ;  "}",       !, { Token = tokRMusta}
      ;  "\\",      !, { Token = tokSlash}
      ;  "->",      !, { Token = tokImpl}
      ;  "+",       !, { Token = tokPlus }
      ;  "-",       !, { Token = tokMinus }
      ;  "*",       !, { Token = tokTimes }
      ;  "=",       !, { Token = tokEqual }
      ;  "<",       !, { Token = tokLt }
      ;  ">",       !, { Token = tokGt }
      ;  "_",       !, { Token = tokUnderscore }
      ;  ".",       !, { Token = tokPeriod }
      ;  "/",       !, { Token = tokForwardSlash }
      ;  ",",       !, { Token = tokComma }
      ;  ";",       !, { Token = tokSemicolon }
      ;  digit(D),  !,
            number(D, N),
            { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
            {  member((Id, Token), [ (div, tokDiv),
                                     (mod, tokMod),
                                     (where, tokWhere)]),
               !
            ;  Token = tokVar(Id)
            }
      ;  [_],
            { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;  [],
         { Tokens = [] }
   ).

with

lexer(Tokens) -->
   white_space,
   (
      (
         op_token(Token), ! % replace ;/2 long chain searched blindly with call to new predicate op_token//1 which clauses have indexed access by first arg in Prolog standard way
      ;
         digit(D),  !, number(D, N),
         { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
         {  member((Id, Token), [ (div, tokDiv),
                                 (mod, tokMod),
                                 (where, tokWhere)]),
            !
      ;  Token = tokVar(Id)
         }
      ;  [_],
         { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;
      [],
      { Tokens = [] }
   ).

%%%
op_token(tokColon)      --> ";".
op_token(tokLParen)     --> "(".
op_token(tokRParen)     --> ")".
op_token(tokLMusta)     --> "{".
op_token(tokRMusta)     --> "}".
op_token(tokBackSlash)  --> "\\".
op_token(tokImpl)       --> "->".
op_token(tokPlus)       --> "+".
op_token(tokMinus)      --> "-".
op_token(tokTimes)      --> "*".
op_token(tokEqual)      --> "=".
op_token(tokLt)         --> "<".
op_token(tokGt)         --> ">".
op_token(tokUnderscore) --> "_".
op_token(tokPeriod)     --> ".".
op_token(tokSlash)      --> "/".
op_token(tokComma)      --> ",".
op_token(tokSemicolon)  --> ";".

Edit by Guy Coder 由Guy Coder编辑

I ran a test using the example data posted in the question into a list where each item in the list was a line in the data converted to character codes. 我使用问题中发布的示例数据运行测试到列表中,列表中的每个项目都是转换为字符代码的数据中的一行。 Then with time/1 called lexer on each item in the list and repeated the test for the list 10000 times. 然后在列表中的每个项目上使用time / 1调用lexer并重复测试列表10000次。 The reason the data was loaded into a list and converted to characters codes before time/1 was so that those processes did not skew the results. 数据加载到列表中并在时间/ 1之前转换为字符代码的原因是这些进程没有使结果产生偏差。 Each of these runs was repeated 5 times to get a consistency of data. 每次运行重复5次以获得数据的一致性。

In the following runs below, for all of the different versions the lexer was extended to cover all of the 7-bit ASCII characters which significantly increased the number of cases for special characters. 在下面的运行中,对于所有不同版本,词法分析器已扩展为涵盖所有7位ASCII字符,这大大增加了特殊字符的个案数量。

The version of Prolog used for the following was SWI-Prolog 8.0. 用于以下的Prolog版本是SWI-Prolog 8.0。

For the version in the question. 对于问题中的版本。

Version: 1

:- set_prolog_flag(double_quotes,chars).

% 694,080,002 inferences, 151.141 CPU in 151.394 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 150.813 CPU in 151.059 seconds (100% CPU, 4602271 Lips)
% 694,080,001 inferences, 152.063 CPU in 152.326 seconds (100% CPU, 4564439 Lips)
% 694,080,001 inferences, 151.141 CPU in 151.334 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 151.875 CPU in 152.139 seconds (100% CPU, 4570074 Lips)

For the version as posted above in this answer 对于此答案中上面发布的版本

Version: 2

:- set_prolog_flag(double_quotes,chars).

% 773,260,002 inferences, 77.469 CPU in 77.543 seconds (100% CPU, 9981573 Lips)
% 773,260,001 inferences, 77.344 CPU in 77.560 seconds (100% CPU, 9997705 Lips)
% 773,260,001 inferences, 77.406 CPU in 77.629 seconds (100% CPU, 9989633 Lips)
% 773,260,001 inferences, 77.891 CPU in 77.967 seconds (100% CPU, 9927511 Lips)
% 773,260,001 inferences, 78.422 CPU in 78.644 seconds (100% CPU, 9860259 Lips)

Version 2 gives a dramatic improvement by using indexing from Version 1. 通过使用版本1中的索引,版本2给出了显着的改进。

In doing further research on the code, upon looking at op_token which is DCG and has two hidden variables for implicitly passing around a state representation, using listing/1 showed: 在进一步研究代码时,查看op_tokenDCG并且有两个隐藏变量用于隐式传递状态表示,使用listing / 1显示:

op_token(tokUnderscore,['_'|A], A).

Notice that the first parameter is not the character being searched and that in this answer the indexing code is written as 请注意,第一个参数不是要搜索的字符,在此答案中索引代码写为

c_digit(0'0,0).

where the first parameter is the character being searched and the second parameter is the result. 其中第一个参数是被搜索的字符,第二个参数是结果。

So change this 所以改变这个

op_token(Token), !

to this 对此

[S], { special_character_indexed(S,Token) }

with indexed clauses as 索引条款为

special_character_indexed( ';' ,tokSemicolon).


Version: 3 版本:3

:- set_prolog_flag(double_quotes,chars).

% 765,800,002 inferences, 74.125 CPU in 74.348 seconds (100% CPU, 10331197 Lips)
% 765,800,001 inferences, 74.766 CPU in 74.958 seconds (100% CPU, 10242675 Lips)
% 765,800,001 inferences, 74.734 CPU in 74.943 seconds (100% CPU, 10246958 Lips)
% 765,800,001 inferences, 74.828 CPU in 75.036 seconds (100% CPU, 10234120 Lips)
% 765,800,001 inferences, 74.547 CPU in 74.625 seconds (100% CPU, 10272731 Lips)

Version 3 gives a slightly better but consistently better result than Version 2. 与版本2相比,版本3提供了稍好但一致的更好的结果。

Lastly just changing double_quotes flag to atom as noted in a comment by AntonDanilov 最后只是将double_quotes标志更改为atom如AntonDanilov在评论中所述

Version: 4

:- set_prolog_flag(double_quotes,atom).

% 765,800,003 inferences, 84.234 CPU in 84.539 seconds (100% CPU, 9091300 Lips)
% 765,800,001 inferences, 74.797 CPU in 74.930 seconds (100% CPU, 10238396 Lips)
% 765,800,001 inferences, 75.125 CPU in 75.303 seconds (100% CPU, 10193677 Lips)
% 765,800,001 inferences, 75.078 CPU in 75.218 seconds (100% CPU, 10200042 Lips)
% 765,800,001 inferences, 75.031 CPU in 75.281 seconds (100% CPU, 10206414 Lips)

Version 4 is almost the same as Version 3. 版本4与版本3几乎相同。

Just looking at CPU numbers, using indexing is faster, eg (Version: 1) 151.875 vs (Version: 3) 74.547 151.875 CPU数量,使用索引更快,例如(版本:1) 151.875 vs(版本:3) 74.547

One thing it means is that this is silly code: 这意味着这是一个愚蠢的代码:

token(T) -->
    ( "1", !, { T = one }
    ; "2", !, { T = two }
    ; "3", !, { T = three }
    )

This is less silly code: 这是不那么愚蠢的代码:

token(T) --> one_two_three(T).

one_two_three(one) --> "1".
one_two_three(two) --> "2".
one_two_three(three) --> "3".

But still not so good. 但还是不太好。 Maybe better: 可能更好:

token(T) --> [X], { one_two_three(X, T) }.

one_two_three(0'1, one).
one_two_three(0'2, two).
one_two_three(0'3, three).

Last example also starts to look silly but remember that now you have indexing on first argument. 最后一个例子也开始看起来很傻,但请记住,现在你已经对第一个参数进行了索引。 You read once, no choice point, no backtrack. 你读了一次,没有选择点,没有回溯。

But if you want to really know how to write efficient you need to measure where the time and space goes. 但是如果你真的想知道如何有效地编写,你需要测量时间和空间的位置。 Have you measured? 你测量过吗?

But if you really want to know how to fix you maybe read "Craft of Prolog", I do not understand all of this book but I remember it had big section on DCG. 但是,如果你真的想知道如何修复你可能会读“Prolog的工艺”,我不明白这本书的全部,但我记得它在DCG上有很大的部分。

But if you really want to parse such formats large files maybe find existing libraries in other languages, it might be much faster than fastest Prolog. 但是如果你真的想要解析这样的格式,大文件可能会找到其他语言的现有库,它可能比最快的Prolog快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM