简体   繁体   English

Java 正则表达式用于标识符(字母、数字和下划线)

[英]Java regex for identifiers (letters, digits and underscores)

Let's say you have given an input which could look like this (identifier1 identifier_2 23 4) .假设您给出了一个看起来像这样的输入(identifier1 identifier_2 23 4)

I want to add a # symbol after every identifier, which can contain letters, digits and underscores.我想在每个标识符之后添加一个#符号,它可以包含字母、数字和下划线。 They can only start with a letter followed by variations of letters, digits and underscores.它们只能以字母开头,后跟字母、数字和下划线的变体。 My approach was something like this:我的方法是这样的:

input.replaceAll("[A-Za-z0-9_]+", "$0#");

However, this also puts # symbols after every single digit which I wanted to exclude.但是,这也会在我想排除的每个数字后面加上#符号。 The result should be (identifier1# identifier_2# 23 4) .结果应该是(identifier1# identifier_2# 23 4) Is it possible to solve this problem with regex?是否有可能用正则表达式解决这个问题?

UPDATE 2 更新2

The Incremental Java says: 增量Java说:

  • Each identifier must have at least one character. 每个标识符必须至少包含一个字符。
  • The first character must be picked from: alpha, underscore, or dollar sign. 必须从以下字符中选择第一个字符:字母,下划线或美元符号。 The first character can not be a digit. 第一个字符不能为数字。
  • The rest of the characters (besides the first) can be from: alpha, digit, underscore, or dollar sign. 其余字符(第一个字符除外)可以来自:字母,数字,下划线或美元符号。 In other words, it can be any valid identifier character. 换句话说,它可以是任何有效的标识符字符。

    Put simply, an identifier is one or more characters selected from alpha, digit, underscore, or dollar sign. 简而言之,标识符是从字母,数字,下划线或美元符号中选择的一个或多个字符。 The only restriction is the first character can't be a digit. 唯一的限制是第一个字符不能为数字。

So, you'd better use 所以,你最好用

String pattern = "(?:\\b[_a-zA-Z]|\\B\\$)[_$a-zA-Z0-9]*+";

See the regex demo 正则表达式演示

UPDATE 更新

Acc. 累积 to Representing identifiers using Regular Expression , the identifier regex is [_a-zA-Z][_a-zA-Z0-9]* . 使用正则表达式表示标识符时 ,标识符regex为[_a-zA-Z][_a-zA-Z0-9]*

So, you may use 因此,您可以使用

String pattern = "\\b[_a-zA-Z][_a-zA-Z0-9]*\\b";

NOTE that it allows _______ . 注意 ,它允许_______

You can use 您可以使用

String p = "\\b_*[a-zA-Z][_a-zA-Z0-9]*\\b";

To avoid that. 为了避免这种情况。 See IDEONE demo . 请参阅IDEONE演示

String s = "(identifier1 identifier_2 23 4) ____ 33"; 
String p = "\\b_*[a-zA-Z][_a-zA-Z0-9]*\\b";
System.out.println(s.replaceAll(p, "$0#"));

Output: (identifier1# identifier_2# 23 4) ____ 33 输出: (identifier1# identifier_2# 23 4) ____ 33

OLD ANSWER 老答案

You can use the following pattern: 您可以使用以下模式:

String p = "\\b(?!\\d+\\b)[A-Za-z0-9]+(?:_[A-Za-z0-9]+)*\\b";

Or (if a _ can appear at the end): 或(如果_可以出现在末尾):

String p = "\\b(?!\\d+\\b)[A-Za-z0-9]+(?:_[A-Za-z0-9]*)*\\b";

See the regex demo 正则表达式演示

The pattern requires that the whole word (as the expression is enclosed with word boundaries \\b ) should not be equal to a number (it is checked with (?!\\d+\\b) ), and the unrolled part [A-Za-z0-9]+(?:_[A-Za-z0-9])* matches non-underscore word character chunks that are followed by zero or more sequences of an underscore followed with non-underscore word character chunks. 该模式要求整个单词(因为表达式用单词边界\\b包围)不应等于数字(使用(?!\\d+\\b) ),并且展开的部分[A-Za-z0-9]+(?:_[A-Za-z0-9])*匹配非下划线单词字符块,后跟零个或多个下划线序列,再后跟非下划线单词字符块。

IDEONE demo : IDEONE演示

String s = "(identifier1 identifier_2 23 4) ____ 33"; 
String p = "\\b(?!\\d+\\b)[A-Za-z0-9]+(?:_[A-Za-z0-9]*)*\\b";
System.out.println(s.replaceAll(p, "$0#")); 

Output: (identifier1# identifier_2# 23 4) ____ 33 输出: (identifier1# identifier_2# 23 4) ____ 33

Your current regex says 您当前的正则表达式说

one or more upper or lower-case letters, digits, or underscores, in whatever order. 一个或多个大写或小写字母,数字或下划线,顺序不限。

According to that regex, 54 is a valid identifier. 根据该正则表达式, 54是有效标识符。

You actually wanted to write 你真的想写

a letter, followed by any number of letters, digits or underscores, in whatever order 字母,后跟任意数量的字母,数字或下划线,顺序不限

That would be written in code as: 可以用以下代码编写:

input.replaceAll("[A-Za-z][A-Za-z0-9_]*", "$0#");

Wiktor notes that this regex will still match "identifiers" that are inside something that is not identifier-ish. Wiktor指出,此正则表达式仍将匹配非标识符形式的内容中的“标识符”。 To solve this, you could use the following variation: 要解决此问题,您可以使用以下变体:

input.replaceAll("\\b([A-Za-z][A-Za-z0-9_]*)\\b", "$1#")

This rejects 123ab123 as a valid identifier, but accepts ab123 in 123 ab123 此拒绝123ab123作为有效的标识符,但接受ab123123 ab123

If you want to use java to read java, java's got you covered: "\\b\\p{javaJavaIdentifierStart}\\p{javaJavaIdentifierPart}*\\b"如果您想使用 java 来读取 java,java 可以为您提供: "\\b\\p{javaJavaIdentifierStart}\\p{javaJavaIdentifierPart}*\\b"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM