简体繁体 English

如何在Delphi中解析复杂的文件格式？（不是CSV，XML等）

[英]How do I parsing a complex file format in Delphi? (Not CSV, XML, etc)

原文 2010-07-20 21:43:12 5 4 delphi/ file/ parsing/ format/ tokenize

It's been a few years since I've had to parse any files which were harder than CSV or XML so I am out of practice. 自从我不得不解析任何比CSV或XML更难的文件以来，已经有几年了，所以我没有实践。 I've been given the task of parsing a file format called NeXus in a Delphi application. 我被赋予了在Delphi应用程序中解析名为NeXus的文件格式的任务。

The problem is I just don't know where to start, do I use a tokenizer, regex, etc? 问题是我只是不知道从哪里开始，我是否使用了标记器，正则表达式等？ Maybe even a tutorial might be what I need at this point. 也许甚至教程可能就是我现在所需要的。

4 个解决方案

Have a look at GOLD Parser . 看看GOLD Parser 。 It's a meta-parsing system that allows you to define a formal grammar for a language/file format. 它是一个元解析系统，允许您为语言/文件格式定义正式语法。 It creates a parsing rules file which you feed into a tokenizer, together with your input file, and it creates a syntax tree in memory. 它创建一个解析规则文件，您可以将其与输入文件一起提供给标记生成器，并在内存中创建语法树。

There's a Delphi implementation of the tokenizer available on the website. 网站上有一个Delphi的tokenizer实现。 It makes parsing a lot easier since the lexing and tokenizing is already taken care of for you, and all you have to worry about is defining the tokens in a formal grammar and then interpreting them once they've been parsed. 它使得解析变得更加容易，因为lexing和tokenizing已经为你完成了，你需要担心的是在正式语法中定义标记，然后在解析后解释它们。

Check this out, it's commercial, but it looks like a fun toy: 看看这个，它是商业的，但它看起来像一个有趣的玩具：

http://dpg.zenithlab.com/ http://dpg.zenithlab.com/

But, actually: For nexus you do not need a complicated parser. 但实际上：对于nexus，您不需要复杂的解析器。

A bit of position checking code, and some string-splitting and parenthesis counting, and you've got it written. 一些位置检查代码，以及一些字符串拆分和括号计数，你已经写好了。

I would parse it using a simple token-at-a-time parser like this: 我会使用一个简单的令牌一次解析器解析它，如下所示：

load file into a TStringList. 将文件加载到TStringList中。
for each line, grab one token at a time, to determine line type. 对于每一行，一次抓取一个令牌，以确定线型。
have an enumerated type for this line type. 具有此线型的枚举类型。
first valid non-blank line should be deteted to be a valid #nexus tag. 应将第一个有效的非空行检测为有效的#nexus标记。
next the header area (skipped mostly it looks like) 接下来的标题区域（跳过大部分看起来像）
begin is the first and keyword on the line. begin是该行的第一个和关键字。
following lines inside the begin block appear to be almost like a DOS command and its command line parameters and are separated by spaces, and end with semicolons. begin块中的以下行看起来几乎像DOS命令及其命令行参数，并以空格分隔，并以分号结束。 pretty much like pascal, but parenthesis. 非常像帕斯卡，但括号。

For the above I would code for myself a little set of helpers, and eventually one of the things I might need to write is a little token splitting function like this: 对于上面我会为自己编写一些帮助器代码，最后我可能需要编写的一个东西就是这样的一个小符号分割函数：

function GetToken( var inputString:String;outputToken:String; const Separators:TStrings;Keywords:TStrings;ParenFlag:Boolean):Boolean; function GetToken（var inputString：String; outputToken：String; const Separators：TStrings; Keywords：TStrings; ParenFlag：Boolean）：Boolean;

GetToken would return true when it was able to find and return a token string from inputString, it would skip any leading whitespace, and terminate when it finds a separator. 当GetToken能够从inputString中查找并返回一个令牌字符串时，它将返回true，它将跳过任何前导空格，并在找到分隔符时终止。 Separators are items like space or comma. 分隔符是空格或逗号等项目。
ParenFlag:True would mean that the next token I get should be an entire parenthesized list of items. ParenFlag：True意味着我得到的下一个标记应该是一个完整的带括号的项目列表。 Once I get the whole parenthesized list (((a,b),(c,d),(e,f))) then I would call another function that would unpack the content of that list into some data structure for the lists/arrays. 一旦我得到整个括号列表（（（a，b），（c，d），（e，f）））然后我会调用另一个函数，将该列表的内容解包为列表/的某些数据结构阵列。

I do not recommend the big parser engine, and the BNF grammar thing will help you write the code if you write a BNF grammar first before you write the parser. 我不建议使用大解析器引擎，如果在编写解析器之前先编写BNF语法，BNF语法将帮助您编写代码。 But there's nothing so brutal here that you can't parse it. 但是这里没有什么是残酷的，你无法解析它。

Are you going to be expected to do queries/transforms on this? 您是否会被要求对此进行查询/转换？ Do you think you need to convert it into json or xml in order to work further with it? 您是否认为需要将其转换为json或xml以便进一步使用它？

In addition to Mason's very nice answer. 除了梅森的非常好的答案。 There is a great little class in Delphi that is often underappreciated, and one that you can learn a really nice technique from and thats the PageProducer class. Delphi中有一个很好的小课程经常被低估，而且你可以从PageProducer课程学到一个非常好的技术。

Have a look at the way that it parses HTML and surfaces events on things like finding tags, attributes etc. I'm not saying use the PageProducer (because you won't be able to for Nexus) but its a very simple, elegant and powerful technique. 看看它解析HTML的方式，并在查找标签，属性等事物上展示事件。我不是说使用PageProducer（因为你无法使用Nexus），但它非常简单，优雅，强大的技术。

Haven't found a test format yet a state machine won't parse. 尚未找到测试格式但状态机无法解析。 Add in recursion to run down nests in trees. 添加递归以在树中运行嵌套。 They are an easily written relatively quick parsing engine that can be built for virtually any patterned text file. 它们是一个易于编写的相对快速的解析引擎，可以为几乎任何带图案的文本文件构建。 Often easier than using a scripting language to boot. 通常比使用脚本语言启动更容易。 I have custom ones written for HTML, XML, HL7 and a variety of medical EDI formats. 我有自定义的HTML，XML，HL7和各种医疗EDI格式。