简体   繁体   English

DSL用于文本解析

[英]DSL for Text Parsing

I have a set of semi-structured TEXT documents of a particular domain (accounting reports), and they all are very similar in content. 我有一组特定域的半结构化TEXT文档(会计报告),它们的内容非常相似。 but, the data are disposed in different ways on each documenttemplates. 但是,数据以不同的方式放置在每个文档模板上。

It was fairly easy to write some regex and get the data I wanted. 编写一些正则表达式并获取我想要的数据是相当容易的。 But it has to be done for every new document layout. 但是,必须为每个新的文档布局完成。

I want to build a generic parser that receive a script of how it should read the accounting report of a particular layout, so that for every new layout all I need to do is to write a new script which is simpler than write a lot of regexes. 我想构建一个通用解析器,该解析器接收一个脚本,该脚本应如何读取特定布局的会计报告,因此对于每个新布局,我要做的就是编写一个比编写大量正则表达式更简单的新脚本。 。

Something like that: 像这样:

parsing script: 解析脚本:

declare collection_name {
  date,
  description,
  amount
}

get customer_name from line 3
get account_id from "AccountID <number>"

read data as <collection_name> from <pattern> until <pattern>

Please give me any clue on where to start, what read about it, or if you already have seen something like. 请给我有关从何开始,如何阅读或您已经看过类似内容的任何线索。 I would really appreciate any help. 我真的很感谢您的帮助。

Building a DSL is not something easy especially with a rich syntax like you proposed, so I assume you are ready :) 建立DSL并非易事,尤其是使用您建议的丰富语法时,因此我认为您已经准备就绪:)

The pipeline is: 管道是:

Script -> Compiler -> PHP code for specific template

Then you are going to use the PHP code to get data 然后,您将使用PHP代码获取数据

TEXT -> PHP code for that template -> data(structured JSON,XML,...)

So to build a compiler you need to understand the flow: 因此,要构建编译器,您需要了解以下流程:

Script -> Lexer(Tokenizer) -> Parser -> AST/CFG -> PHP code generation

Definitions https://stackoverflow.com/a/380487/877594 定义https://stackoverflow.com/a/380487/877594

  • Tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines). 标记生成器打破文本流转换为标记,通常是通过寻找空白(制表符,空格,换行)。

  • Lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator. Lexer本质上是一个令牌生成器,但它通常在令牌上附加额外的上下文-该令牌是一个数字,该令牌是一个字符串文字,另一个令牌是一个等于运算符。

  • Parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text. 解析器从词法分析器中获取令牌流,并将其转换为代表原始文本表示的(通常)程序的抽象语法树。

Abstract syntax tree http://en.wikipedia.org/wiki/Abstract_syntax_tree 抽象语法树http://en.wikipedia.org/wiki/Abstract_syntax_tree

A tree representation of the abstract syntactic structure of source code written in a programming language. 用编程语言编写的源代码抽象句法结构的树形表示。 Each node of the tree denotes a construct occurring in the source code. 树的每个节点表示在源代码中出现的构造。 The syntax is "abstract" in not representing every detail appearing in the real syntax. 语法是“抽象的”,不能代表真实语法中出现的每个细节。 For instance, grouping parentheses are implicit in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node with two branches. 例如,分组括号在树结构中是隐式的,并且可以通过具有两个分支的单个节点来表示类似if-condition-then表达式的句法构造。

They are good for expressions not instructions, if you are considering using expressions in your DSL. 如果您正在考虑在DSL中使用表达式,那么它们对于表达式而不是说明很有用。

Control flow graph http://en.wikipedia.org/wiki/Control_flow_graph 控制流程图http://en.wikipedia.org/wiki/Control_flow_graph

a representation, using graph notation, of all paths that might be traversed through a program during its execution. 使用图形表示法表示在程序执行期间可能遍历的所有路径。

Each node is an instruction object (declare, get, read,...) with attributes. 每个节点都是具有属性的指令对象(声明,获取,读取等)。 eg: 例如:

get {
    target: customer_name,
    from: line {n: 3}
}

Building 建造

PHP is a very poor choice, because there are no quality libraries to build lexers and parsers, like Flex/Bison in C/C++. PHP是一个非常差的选择,因为没有高质量的库来构建词法分析器和解析器,例如C / C ++中的Flex / Bison。 In this question there are some tools but I don't recommend them Flex/Bison-like functionality within PHP . 在这个问题中,有一些工具,但是我不建议他们在PHP中使用类似Flex / Bison的功能

I suggest that you build it yourself: 我建议您自己构建:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM