简体   繁体   English

Java 正则表达式匹配有效的 Java 标识符

[英]Java regular expression to match valid Java identifiers

I need to create a regular expression able to find and get valid identifiers in Java code like this:我需要创建一个能够在 Java 代码中查找并获取有效标识符的正则表达式,如下所示:

int a, b, c;
float d, e;
a = b = 5;
c = 6;
if ( a > b)
{
c = a - b;
e = d - 2.0;
}
else
{
d = e + 6.0;
b = a + c;
}

I have tried to add multiple regexes in a single regex, but how can I build a pattern to exclude reserved words?我试图在单个正则表达式中添加多个正则表达式,但如何构建一个模式来排除保留字?

I tried this regex ^(((&&|<=|>=|<|>|,=|==|&|.)|([-+=]{1?2})|([;,.)}{;,(-]))|(else|if|float|int)|(\d[\d.])) but it does not work as expected.我试过这个正则表达式^(((&&|<=|>=|<|>|,=|==|&|.)|([-+=]{1?2})|([;,.)}{;,(-]))|(else|if|float|int)|(\d[\d.]))但它没有按预期工作。

Online demo在线演示

In the following picture, how should I match for identifiers?在下图中,我应该如何匹配标识符?

在此处输入图像描述

A Java valid identifier is: Java 有效标识符是:

  1. having at least one character至少有一个角色
  2. the first character MUST be a letter [a-zA-Z] , underscore _ , or dollar sign $第一个字符必须是字母[a-zA-Z] 、下划线_或美元符号$
  3. the rest of the characters MAY be letters, digits, underscores, or dollar signs字符的 rest 可以是字母、数字、下划线或美元符号
  4. reserved words MUST not be used as identifiers保留字不得用作标识符
  5. Update : as single underscore _ is a keyword since Java 9更新:因为单下划线_自 Java 9 以来的关键字

A naive regexp to validate the first three conditions would be as follows: (\b([A-Za-z_$][$\w]*)\b) but it does not filter out the reserved words.验证前三个条件的简单正则表达式如下: (\b([A-Za-z_$][$\w]*)\b)但它不会过滤掉保留字。

To exclude the reserved words, negative look-ahead (?!) is needed to specify a group of tokens that cannot match: \b(?!(_\b|if|else|for|float|int))([A-Za-z_$][$\w]*) :要排除保留字,需要否定前瞻(?!)来指定一组无法匹配的标记: \b(?!(_\b|if|else|for|float|int))([A-Za-z_$][$\w]*) :

  • Group #1: (?!(_\b|if|else|for|float|int)) excludes the list of the specified words第 1 组: (?!(_\b|if|else|for|float|int))排除指定单词的列表
  • Group #2: ([A-Za-z_$][$\w]*) matches identifiers.第 2 组: ([A-Za-z_$][$\w]*)匹配标识符。

However, word border \b consumes dollar sign $ , so this regular expression fails to match identifies starting with $ .但是,单词边框\b消耗美元符号$ ,因此此正则表达式无法匹配以$开头的标识。
Also, we may want to exclude matching inside string and character literals ("not_a_variable", 'c', '\u65').此外,我们可能希望排除字符串和字符文字(“not_a_variable”、“c”、“\u65”)内部的匹配。

This can be done using positive lookbehind (?<=) to match a group before main expression without including it in the result instead of the word-border class \b : (?<=[^$\w'"\\])(?!(_\b|if|else|for|float|int))([A-Za-z_$][$\w]*)这可以使用正后向(?<=)来匹配主表达式之前的组来完成,而不将其包含在结果中而不是单词边界 class \b : (?<=[^$\w'"\\])(?!(_\b|if|else|for|float|int))([A-Za-z_$][$\w]*)

Online demo for a short list of reserved words保留字简短列表的在线演示

Next, the full list of the Java reserved words is as follows, which can be collected into a single String of tokens separated with |接下来,Java个保留字的完整列表如下,可以收集成一个用|分隔的token字符串. .

A test class showing the final pattern for regular expression and its usage to detect the Java identifiers is provided below.下面提供了一个测试 class,它显示了正则表达式的最终模式及其检测 Java 标识符的用法。

import java.util.Arrays;
import java.util.List;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;

public class IdFinder {

    static final List<String> RESERVED = Arrays.asList(
        "abstract", "assert", "boolean", "break", "byte", "case", "catch", "char", "class", "const",
        "continue", "default", "double", "do", "else", "enum", "extends", "false", "final", "finally",
        "float", "for", "goto", "if", "implements", "import", "instanceof", "int", "interface", "long",
        "native", "new", "null", "package", "private", "protected", "public", "return", "short", "static",
        "strictfp", "super", "switch", "synchronized", "this", "throw", "throws", "transient", "true", "try",
        "void", "volatile", "while", "_\\b"
    );

    static final String JAVA_KEYWORDS = String.join("|", RESERVED);

    static final Pattern VALID_IDENTIFIERS = Pattern.compile(
            "(?<=[^$\\w'\"\\\\])(?!(" + JAVA_KEYWORDS + "))([A-Za-z_$][$\\w]*)");

    public static void main(String[] args) {
        System.out.println("ID pattern:\n" + VALID_IDENTIFIERS.pattern());

        String code = "public class Main {\n\tstatic int $1;\n\tprotected char _c0 = '\\u65';\n\tprivate long c1__$$;\n}";

        System.out.println("\nIdentifiers in the following code:\n=====\n" + code + "\n=====");

        VALID_IDENTIFIERS.matcher(code).results()
                         .map(MatchResult::group)
                         .forEach(System.out::println);
    }
}

Output Output

ID pattern:
(?<=[^$\w'"\\])(?!(abstract|assert|boolean|break|byte|case|catch|char|class|const|continue|default|double|do|else|enum|extends|false|final|finally|float|for|goto|if|implements|import|instanceof|int|interface|long|native|new|null|package|private|protected|public|return|short|static|strictfp|super|switch|synchronized|this|throw|throws|transient|true|try|void|volatile|while|_\b))([A-Za-z_$][$\w]*)

Identifiers in the following code:
=====
public class Main {
    static int $1;
    protected char _c0 = '\u65';
    private long c1__$$;
}
=====
Main
$1
_c0
c1__$$

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM