简体   繁体   English

返回Java正则表达式(单词,空格,特殊字符,双引号)

[英]Returning java regex (words, spaces, special characters, double quotes)

I am trying to use java regex to tokenize any language source file. 我正在尝试使用Java正则表达式来标记任何语言源文件。 What I want the list to return is: 我想要列表返回的是:

  • words ( [a-z_A-Z0-9] ) 字( [a-z_A-Z0-9]
  • spaces 空间
  • any of [()*.,+-/=&:] as a single character [()*.,+-/=&:]中的任何一个都作为单个字符
  • and quoted items left in quotes. 和引号中留有引号的项目。

Here is the code I have so far: 这是我到目前为止的代码:

Pattern pattern = Pattern.compile("[\"(\\w)\"]+|[\\s\\(\\)\\*\\+\\.,-/=&:]");

Matcher matcher = pattern.matcher(str);
List<String> matchlist = new ArrayList<String>();

while(matcher.find()) {
    matchlist.add(matcher.group(0));
}

For example, 例如,

"I" am_the 2nd "best".

returns: list, size 8 返回:列表,大小8

("I", ,am_the, ,2nd, ,"best", .)

which is what I want. 这就是我想要的。 However, if the whole sentence is quoted, except for the period: 但是,如果引用整个句子(句号除外),则:

"I am_the 2nd best".

returns: list, size 8 返回:列表,大小8

("I, ,am_the, ,2nd, ,best", .)

and I want it to be able to return: list, size 2 我希望它能够返回:列表,大小2

("I am_the 2nd best", .)

If that makes sense. 如果这样的话。 I believe it works for everything I want it to except for returning string literals (which I want to keep the quotes). 我相信它可以满足我想要的一切,除了返回字符串文字(我想保留引号)。 What is it that I am missing from the pattern that will allow me to achieve this? 我将无法实现的模式中缺少什么?

And by all means, if there is an easier pattern to use that I do not see, please help me out. 而且,如果有一种我看不到的更容易使用的模式,请帮帮我。 The pattern shown above was the compilation of many trial/error. 上面显示的模式是许多试验/错误的汇总。 Thank you very much in advance for any help. 非常感谢您的帮助。

First, you'll need to separate the word-matching code from the string-literal-matching code. 首先,您需要将单词匹配代码与字符串文字匹配代码分开。 For word matching, use: 对于单词匹配,请使用:

\w+

Next there's whitespace. 接下来是空白。

\s+

To match strings as one token, you need to allow more characters than just \\w . 要将字符串作为一个标记进行匹配,您需要允许的字符不仅仅是\\w That only allows alphanumeric characters and _ , which means whitespace and symbols are not. 那只允许使用字母数字字符和_ ,这意味着不能使用空格和符号。 You also need to move the starting and ending quotes outside of the square brackets. 您还需要将开始和结束引号移到方括号之外。

And don't forget backslashes to escape characters. 并且不要忘记使用反斜杠来转义字符。 You want to allow \\" inside of strings. 您想在字符串中允许\\"

"(\\.|[^"])+"

Finally, there are the symbols. 最后,有符号。 You could list all the symbols, or you could just treat any non-word, non-whitespace, non-quote character as a symbol. 您可以列出所有符号,也可以将任何非单词,非空格,非引号字符视为符号。 I recommend the latter so you don't choke on other symbols like @ or | 我建议使用后者,这样您就不会在其他符号(如@| . So for symbols: 因此对于符号:

[^\s\w"]

Putting the pieces together, we get this combined regex: 将各个部分放在一起,我们得到以下组合的正则表达式:

\w+|\s+|"(\\.|[^"])+"|[^\s\w"]

Or, escaping everything properly so it can be put into source code: 或者,适当地转义所有内容,以便将其放入源代码中:

Pattern pattern = Pattern.compile("\\w+|\\s+|\"(\\\\.|[^\"])+\"|[^\\s\\w\"]");

Typically, when parsing text, the process you're describing is called "lexical analysis" and the function used is called a 'lexer' which is used to break up an input stream into identifiable tokens like words, numbers, spaces, periods, etc. 通常,在解析文本时,您正在描述的过程称为“词法分析”,而所使用的功能称为“词法分析器”,该词法分析器用于将输入流分解为可识别的标记,例如单词,数字,空格,句点等。 。

The output of a lexer is consumed by a 'parser' which does "syntactic analysis" by identifying groups of tokens which belong together, like [double-quote] [word] [double-quote]. 词法分析器的输出由“解析器”消耗,“解析器”通过识别属于在一起的标记组(例如[双引号] [单词] [双引号])来进行“语法分析”。

I would recommend you follow the same two-pass strategy, since it's been proven time and again in many, many parsers. 我建议您遵循相同的两遍策略,因为在许多解析器中已经被反复证明了这一点。

So, your first step might be to use this regular expression as your lexer: 因此,您的第一步可能是使用此正则表达式作为词法分析器:

\W|\w+

which will break your input text into either single non-word characters (like spaces, double and single quotation marks, commas, periods, etc.) or sequences of one or more word characters where \\w is really just a shortcut for [a-zA-Z_0-9] . 这会将您的输入文本分为单个非单词字符(例如空格,双引号和单引号,逗号,句点等)或一个或多个单词字符的序列,其中\\w实际上只是[a-zA-Z_0-9]

So, using your example above: 因此,使用上面的示例:

String str=/"I" am_the 2nd "best"./

String p="\\W|\\w+"

Pattern pattern = Pattern.compile(p);
Matcher matcher = pattern.matcher(str);
List<String> matchlist = new ArrayList<String>();

while(matcher.find()) {
    matchlist.add(matcher.group(0));
}

produces: 产生:

['"', 'I', '"', ' ', 'am_the', ' ', '2nd', ' ', '"', 'best', '"', '.']

which you can then decide how to treat in your code. 然后您可以决定如何在代码中进行处理。

No, this doesn't give you a single one-size-fits-all regular expression which matches both of the cases you list above, but in my experience, regular expressions aren't really the best tool to do the kind of syntactic analysis you require because they either lack the expressiveness needed to cover all possible cases or, and this is far more likely, they quickly become far too complex for most but the true RegExp maven to fully comprehend. 不,这并不能为您提供一个单一的,适合所有情况的正则表达式,该正则表达式与上面列出的两种情况都匹配,但是根据我的经验,正则表达式并不是真正进行语法分析的最佳工具您之所以需要它们,是因为它们要么缺乏涵盖所有可能情况的表达能力,要么更可能是,对于大多数人来说 ,它们很快变得太复杂了 ,但真正的RegExp专家却无法完全理解。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM