简体   繁体   English

在Lexical Analyzer(Java)中使用正则表达式

[英]Using Regex in Lexical Analyzer (Java)

I am working on building a lexical analyzer for a fictional XML-style language and I'm currently trying to turn the following lexical specification into Java code: 我正在为一种虚构的XML风格的语言构建词法分析器,目前正在尝试将以下词法规范转换为Java代码:

Name -> Initial Other*
Initial -> Letter | _ | :
Other -> Initial | Digit | - | .
String -> " (Char | ')* " | '(Char | ")* '
Data -> Char+
Char -> Ordinary | Special | Reference
Ordinary -> NOT (< | > | " | ' | &) 
Special -> &lt; | &gt; | &quot; | &apos; | &amp;
Reference -> &#(Digit)+; | &#x(Digit|a...f|A...F)+;
Letter -> a...z | A...Z
Digit -> 0...9

I'm no expert, but I do know I have to use regular expressions for these. 我不是专家,但我知道我必须为此使用正则表达式。 So my Tokenizer now looks like this: 所以我的令牌生成器现在看起来像这样:

public Tokenizer(String str) {
    this.tokenContents = new ArrayList<TokenContent>();
    this.str = str;

    // Name = Initial Other*
    String initial = "[a-zA-Z] | _ | :";
    String other = initial + " | [0-9] | - | \\.";
    String name = initial + "(" + other + ")*";
    tokenContents.add(new TokenContent(Pattern.compile(name), TokenType.NAME));
    // String = " " (Char | ')* " | ' (Char | ")* '
    String ordinary = "(?!(< | > | \" | ' | &))";
    String special = "&lt; | &gt; | &quot; | &apos; | &amp;";
    String reference = "&#[0-9]+; | &#x([0-9] | [a-fA-F])+;";
    String character = ordinary + " | " + special + " | " + reference;
    String string = "\"(" + character + " | " + "')* \" | ' (\"" + character + " | " + "\")* '";
    tokenContents.add(new TokenContent(Pattern.compile(string), TokenType.STRING));
    // Data = Char+
    String data = character + "+"; 
    tokenContents.add(new TokenContent(Pattern.compile(data), TokenType.DATA)); 
    // The symbol <
    tokenContents.add(new TokenContent(Pattern.compile("<"), TokenType.LEFT_TAG));
    // The symbol >
    tokenContents.add(new TokenContent(Pattern.compile(">"), TokenType.RIGHT_TAG));
    // The symbol </
    tokenContents.add(new TokenContent(Pattern.compile("</"), TokenType.LEFT_TAG_SLASH));
    // The symbol />
    tokenContents.add(new TokenContent(Pattern.compile("/>"), TokenType.RIGHT_TAG_SLASH));  
    // The symbol = 
    tokenContents.add(new TokenContent(Pattern.compile("="), TokenType.EQUALS));    
}

For simplicity, you can see I have modularized my regular expressions according to the specification above. 为简单起见,您可以看到我已经根据上述规范对正则表达式进行了模块化。 However, after several test cases of running the lexer on an example input file, I get parsing errors. 但是,在一个示例输入文件上运行lexer的几个测试案例之后,我得到了解析错误。 I believe it might be my regular expressions, so I would like some suggestions on how I can correctly translate the above specification into code and fix my Tokenizer. 我相信这可能是我的正则表达式,所以我想就如何将上述规范正确地转换为代码并修复Tokenizer提出一些建议。

My tokens are Name , String , Data , < , > , </ , /> , and = . 我的令牌是NameStringData<><//>= They are all specified in an enum class that isn't displayed here. 它们都在此处未显示的enum类中指定。 An example input file is: 输入文件示例为:

<recipe name="bread" prep_time="5 mins" cook_time="3 hours">
   <title>Basic bread</title>
   <ingredient amount="3" unit="cups">Flour</ingredient>
   <ingredient amount="0.25" unit="ounce">Yeast</ingredient>
   <ingredient amount="1.5" unit="cups" state="warm">Water</ingredient>
   <ingredient amount="1" unit="teaspoon">Salt</ingredient>
   <instructions>
     <step>Mix all ingredients together.</step>
     <step>Knead thoroughly.</step>
     <step>Cover with a cloth, and leave for one hour in warm room.</step>
     <step>Knead again.</step>
     <step>Place in a bread baking tin.</step>
     <step>Cover with a cloth, and leave for one hour in warm room.</step>
     <step>Bake in the oven at 350&#x00B0; F for 30 minutes.</step>
   </instructions>
</recipe>

I've never worked with regular expressions much before so this is a first for me. 我之前从未使用过正则表达式,所以对我来说这是第一次。 I would really appreciate any input that could help. 我将非常感谢任何可以提供帮助的意见。

String ordinary = "(?!(< | > | \" | ' | &))";

This pattern won't do what you want it to. 这种模式不会做您想要的。 Lookahead is a feature that is used to make a pattern match only if it's followed (or, in the case of negative lookahead as you use here, not followed) by a specific pattern. 前瞻是一项功能,仅当特定模式跟随(或在此处使用否定先行的情况下, 遵循)后,才用于使模式匹配。 The lookahead itself does not consume any input. 前瞻本身不消耗任何输入。

Take for example the pattern [az]+(?=\\s) . 以模式[az]+(?=\\s)为例。 This will match a sequence of letters that are followed by a whitespace, but not the whitespace itself. 这将匹配一个字母序列,后面跟一个空格,但不匹配空格本身。 So the pattern would match the "abc" in "abc def" and would not match anything in "abc_def". 因此该模式将与“ abc def”中的“ abc”匹配,而与“ abc_def”中的任何内容都不匹配。 But either way the match would not include the space. 但无论哪种方式,比赛都不会包含空格。 If you use this in a tokenizer (that also has a rule for whitespace), this will cause "abc def " to be tokenized as "abc", " ", "def", " ", rather than "abc ", "def ". 如果在令牌生成器(也有空白规则)中使用它,这将导致将“ abc def”标记为“ abc”,“”,“ def”,“”,而不是“ abc”,“ def ”。 So that's useful. 这很有用。

But in your case your entire pattern is lookahead. 但就您而言,您的整个模式是超前的。 So if you tokenized something using your rule, the result would look more like "", "", ... ad infinitum. 因此,如果您使用自己的规则标记了某些内容,结果将看起来更像是“”,“”,...无限。 That's less useful. 那没什么用。

What you want is a negative character class, which is created using [^...] where the ... is a list of characters or character ranges as you'd use with a normal character class. 您想要的是一个否定字符类,它是使用[^...]创建的,其中...是与常规字符类一起使用的字符或字符范围的列表。 It matches exactly one character as long as that character is not in the specified list. 只要该字符不在指定列表中,它就与一个字符完全匹配。 Using this, your regex would look like this: 使用此,您的正则表达式将如下所示:

String ordinary = "[^<>\"'&]";

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM