简体   繁体   English

用Java构建词法分析器

[英]Building a lexical Analyzer in Java

I am presently learning Lexical Analysis in Compiler Design. 我目前正在学习编译器设计中的词法分析。 In order to learn how really a lexical analyzer works I am trying to build one myself. 为了了解词法分析器的工作原理,我尝试自己构建一个。 I am planning to build it in Java. 我打算用Java构建它。

The input to the lexical analyzer is a .tex file which is of the following format. 词法分析器的输入是一个.tex文件,其格式如下。

\begin{document}

    \chapter{Introduction}

    \section{Scope}

    Arbitrary text.

    \section{Relevance}

    Arbitrary text.

    \subsection{Advantages}

    Arbitrary text.

    \subsubsection{In Real life}

    \subsection{Disadvantages}

    \end{document}

The output of the lexer should be a table of contents possibly with page numbers in another file. 词法分析器的输出应该是一个目录,目录中可能包含另一个文件中的页码。

1. Introduction   1
  1.1 Scope         1 
  1.2 Relevance     2  
    1.2.1 Advantages  2
       1.2.1.1 In Real Life  2
     1.2.2 Disadvantages   3 

I hope that this problem is within the scope of the lexical analysis . 我希望这个问题在词法分析的范围之内

My lexer would read the .tex file and check for '\\' and on finding continues reading to check whether it is indeed one of the sectioning commands. 我的词法分析器将读取.tex文件并检查'\\',然后继续读取以检查它是否确实是分段命令之一。 A flag variable is set to indicate the type of sectioning. 设置标记变量以指示分段的类型。 The word in curly braces following the sectioning command is read and written along prefixed with a number (like 1.2.1) depending upon the type and depth. 根据类型和深度,将在section命令之后的大括号中的单词前后加上一个数字(如1.2.1)。

I hope the above approach would work for building the lexer. 我希望以上方法对构建词法分析器有用。 How do I go about in adding page numbers to the table of contents if that's possible within the scope of the lexer? 如果在词法分析器的范围内,如何在目录中添加页码?

You really could add them any way you want. 您确实可以按照需要添加它们。 I would recommend storing the contents of your .tex file in your own tree-like or map-like structure, then read in your page numbers file, and apply them appropriately. 我建议将.tex文件的内容存储在自己的树状或地图状结构中,然后读入页码文件,并适当地应用它们。

A more archaic option would be to write a second parser that parses the output of your first parser and the line numbers file and appends them appropriately. 较古老的选择是编写第二个解析器,以解析第一个解析器和行号文件的输出,并适当地附加它们。

It really is up to you. 确实取决于您。 Since this is a learning exercise, try to build as if someone else were to use it. 由于这是一项学习练习,因此请尝试进行构建,就像其他人正在使用它一样。 How user-friendly is it? 它的用户友好程度如何? Making something only you can use is still good for concept learning, but could lead to messy practices if you ever use it in the real world! 制作仅您可以使用的东西仍然对概念学习有好处,但是如果您在现实世界中使用它,可能会导致混乱的做法!

What you describe is really a lexer plus parser. 您所描述的实际上是一个词法分析器加解析器。 The job of the lexical analyser here is to return tokens and ignore whitespace. 此处的词法分析器的工作是返回标记并忽略空格。 The tokens here are the various keywords introduced by '\\', string literals inside '{', '}' and arbitrary text elsewhere. 这里的标记是由'\\'引入的各种关键字,'{','}'内的字符串文字和其他位置的任意文本。 Everything else you dscribed is parsing and tree-building. 您描述的其他所有内容都是解析和树构建。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM