简体繁体 English

使用正则表达式库在C ++中创建词法分析器？

[英]Using regex library to create lexical analyzer in C++?

原文 2016-10-12 02:15:05 6 2 c++/ regex/ lexical-analysis/ skeleton-code

I am trying to write an XML scanner in C++. 我正在尝试用C ++编写XML扫描程序。 I would ideally like to use the regex library as it would be much easier. 我理想情况下喜欢使用正则表达式库，因为它会更容易。

However, I'm a little stumped as to how to do it. 但是，我对如何做到这一点感到有点困惑。 So, first I need to create the regular expressions for each token in the language. 所以，首先我需要为该语言中的每个标记创建正则表达式。 I could use a map to store pairs of these regexes in addition to the name of the token. 除了令牌的名称之外，我还可以使用映射来存储这些正则表达式的对。

Next, I would open an input file and want to use an iterator to iterate through the strings in my file and match them to a regex. 接下来，我将打开一个输入文件，并希望使用迭代器迭代我的文件中的字符串并将它们与正则表达式匹配。 However, in XML, you don't have spacing to separate strings. 但是，在XML中，您没有间隔来分隔字符串。

So my question is will this method even work? 所以我的问题是这种方法甚至可以工作吗？ Also, how exactly will the regex library fit my needs? 此外，正则表达式库究竟如何满足我的需求？ Is regex_match enough to fit my needs in a foolproof way so that my scanner isn't tricked? regex_match是否足以满足我的需求，以便我的扫描仪不被欺骗？

I'm just trying to create a skeleton of the process in my head so that I can start working on this. 我只是想在脑海中创建一个过程的骨架，以便我可以开始研究它。 I wanted some input from others to see if I'm thinking about the problem correctly. 我想要别人的一些意见，看看我是否正确地思考问题。

I'd appreciate any thoughts on this. 我对此有任何想法。 Thanks so much! 非常感谢！

2 个解决方案

Lexical analysis usually proceeds by sequentially matching tokens, where each token corresponds to the longest possible match from a set of possible regular expressions. 词法分析通常通过顺序匹配令牌来进行，其中每个令牌对应于来自一组可能的正则表达式的最长可能匹配。 Since each match is anchored where the previous token ended, no searching is performed. 由于每个匹配都锚定在前一个令牌结束的位置，因此不执行搜索。

Here, I use the word "token" slightly loosely; 在这里，我稍微松散地使用“令牌”这个词; whitespace and comments are also matched as tokens, but in most programming languages they are simply ignored after being recognised. 空格和注释也匹配为标记，但在大多数编程语言中，它们在被识别后被忽略。 A conformant XML tokenizer would need to recognize them as tokens, though, so the usage would be precise for your problem domain. 但是，符合标准的XML标记生成器需要将它们识别为标记，因此对您的问题域的使用将是精确的。

Rather than immersing yourself in a sea of annoying details, you might want to learn about (f)lex, which efficiently implements this algorithm given a collection of regular expressions. 您可能希望了解（f）lex，而不是沉浸在恼人细节的海洋中，它可以在给定正则表达式集合的情况下有效地实现此算法。 It also takes care of buffer handling and some other details which let you concentrate on understanding the nature of the lexical analysis process. 它还负责缓冲处理和其他一些细节，让您专注于理解词法分析过程的本质。

There is a tool for this, called RE/flex that generates scanners: 有一个工具，称为RE / flex，生成扫描仪：

https://sourceforge.net/projects/re-flex https://sourceforge.net/projects/re-flex

The generated scanners use regex engines such as Boost.Regex. 生成的扫描程序使用正则表达式引擎，如Boost.Regex。 Boost.Regex is used via an API to handle different types of input, so there is some additional C++ code. Boost.Regex通过API用于处理不同类型的输入，因此还有一些额外的C ++代码。 Not the bare-bones Boost.Regex API calls that you may be looking for. 不是您可能正在寻找的基本Boost.Regex API调用。

The examples included with RE/flex includes an XML scanner in C++ that may help you to get started. RE / flex中包含的示例包括C ++中的XML扫描程序，可以帮助您入门。 RE/flex also supports UTF-8 encoding which you will need to properly scan XML. RE / flex还支持UTF-8编码，您需要正确扫描XML。