[英]How to extract words from a given string in Java
I am trying to extract all the words (even the words with brackets next to it - methods/functions in a programming language) 我正在尝试提取所有单词(甚至旁边有括号的单词-编程语言中的方法/函数)
But I can only get the first word, not all the words. 但是我只能得到第一个单词,而不是所有单词。 How can I iterate through all the words that match the given
regex
? 如何遍历与给定
regex
匹配的所有单词?
Here is what I tried. 这是我尝试过的。 My
String
is a text file I am reading and it looks like this. 我的
String
是我正在阅读的文本文件,它看起来像这样。
infile >> name;
infile >> Id;
cout << name << " " << Id << endl;
hwp = compute_hw_participation (infile);
tests = compute_tests(tests, infile);
totalscore = compute_totalscore (totalscore, infile);
printRecord (name, Id, hwp, tests, totalscore, outfile);
infile >> name;
return 0;
}
Additionally, I am trying to find the methods
in this String
methods are 此外,我试图找到
methods
在这个String
的方法是
compute_hw_participation(infile) compute_hw_participation(infile)
compute_totalscore(totalscore, infile) compute_totalscore(totalscore,infile)
printRecord (name, Id, hwp, tests, total score, outfile) //this method has a space between the method name and parenthesis, I need to get the parenthesis too (until the end of the parenthesis) despite the space, how can I achieve that too? printRecord(名称,Id,hwp,测试,总成绩,输出文件) //此方法在方法名称和括号之间有一个空格,尽管有空格,我也需要获取括号(直到括号的末尾),如何我也实现了吗?
This is what I have tried: 这是我尝试过的:
package com.codeingrams.recursion;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
*
* @author Jananath Banuka
*/
public class Test {
private static final Pattern p = Pattern.compile(" [^\\s(]+\\([^)]*\\)|\\S+");
public static void main(String[] args) {
String text = "\n"
+ "compute_hw_participation(infile) infile >> name; \n"
+ "while(!infile.eof())\n"
+ "{\n"
+ "infile >> Id;\n"
+ "cout << name << \" \" << Id << endl;\n"
+ "hwp = compute_hw_participation (infile);\n"
+ "tests = compute_tests(tests, infile);\n"
+ "totalscore = compute_totalscore (totalscore, infile);\n"
+ "// grade\n"
+ "printRecord (name, Id, hwp, tests, totalscore, outfile);\n"
+ "infile >> name; \n"
+ "}\n"
+ "\n"
+ "return 0;\n"
+ "}\n"
+ "";
// create matcher for pattern p and given string
Matcher m = p.matcher(text);
// if an occurrence if a pattern was found in a given string...
if (m.find()) {
// ...then you can use group() methods.
System.out.println(m.group(0)); // gives only infile
System.out.println(m.group(1)); //this gives error arrayIndexoutofBound
}
}
}
Output: 输出:
compute_hw_participation(infile) Error: Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1 at java.util.regex.Matcher.group(Matcher.java:538) at com.codeingrams.recursion.Test.main(Test.java:44)
compute_hw_participation(infile)错误:线程“ main”中的异常java.lang.IndexOutOfBoundsException:com.codeingrams.recursion.Test.main(测试中,java.util.regex.Matcher.group(Matcher.java:538)处没有组1 .java:44)
You need a pattern that matches function calls, ie, a name, possibly space, an opening parenthesis, some arguments, and a closing parenthesis. 您需要一个与函数调用匹配的模式,即名称,可能的空格,左括号,一些参数和右括号。
Looking at the Javadoc for Pattern you see the character classes you can use in regular expressions. 查看Javadoc for Pattern,您会看到可以在正则表达式中使用的字符类。 You'll need:
你需要:
\\w
\\w
\\s*
, the *
means zero to many times \\s*
, *
表示零到很多次 \\(
\\(
[^)]*
The [
and ]
create a group, the ^
is negation, meaning anything but what's in the group. [^)]*
[
和]
创建一个组, ^
是负号,表示该组中什么都没有。 \\)
\\)
Then you need to add another backlash to each backslash, as Java Strings also use the backslash for special characters like \\n
. 然后,您需要为每个反斜杠添加另一个反斜杠,因为Java字符串还将反斜杠用于特殊字符,例如
\\n
。
You also need to add parenthesis for capturing around the data you're interested in. That is also the reason you had to quote the parenthesis to match them --- unquoted parenthesis means grouping or capturing. 您还需要添加括号以捕获您感兴趣的数据。这也是您必须对括号进行引用以使其匹配的原因-不带括号的括号表示分组或捕获。
With that the total regular expression is (\\w+\\s*\\([^)]*\\))
. 这样,总正则表达式为
(\\w+\\s*\\([^)]*\\))
。
Here's the full program: 这是完整的程序:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
String text = "\n"
+ "compute_hw_participation(infile) infile >> name; \n"
+ "while(!infile.eof())\n"
+ "{\n"
+ "infile >> Id;\n"
+ "cout << name << \" \" << Id << endl;\n"
+ "hwp = compute_hw_participation (infile);\n"
+ "tests = compute_tests(tests, infile);\n"
+ "totalscore = compute_totalscore (totalscore, infile);\n"
+ "// grade\n"
+ "printRecord (name, Id, hwp, tests, totalscore, outfile);\n"
+ "infile >> name; \n"
+ "}\n"
+ "\n"
+ "return 0;\n"
+ "}\n";
Pattern p = Pattern.compile("(\\w+\\s*\\([^)]*\\))");
Matcher m = p.matcher(text);
while (m.find()) {
System.out.println(m.group());
}
}
}
You'll see the limitations of this simplistic approach very quickly: it also thinks while(!infile.eo()
is a function as it looks like a function call. The code does not know about any possible language keywords. Note also that it does not catch the last closing parenthesis in the while
expression. That is because it does not count parenthesis and just stops at the first closing parenthesis. The regex also has no clue about comments or Strings, and would happily pick up commented out code or Strings like "foo()"
. 您会很快看到这种简单方法的局限性:它还认为
while(!infile.eo()
是一个函数,因为它看起来像一个函数调用。该代码不知道任何可能的语言关键字。不会在while
表达式中捕获最后一个结束括号,这是因为它不计算括号,只是在第一个结束括号处停止,正则表达式也不了解注释或字符串,并且会很高兴地选择注释掉的代码或字符串就像"foo()"
。
Because of that you're almost always better off using a real parser for the language you are dealing with. 因此,对于要处理的语言,使用真正的解析器几乎总是更好。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.