简体   繁体   English

如何从Java中的给定字符串中提取单词

[英]How to extract words from a given string in Java

I am trying to extract all the words (even the words with brackets next to it - methods/functions in a programming language) 我正在尝试提取所有单词(甚至旁边有括号的单词-编程语言中的方法/函数)

But I can only get the first word, not all the words. 但是我只能得到第一个单词,而不是所有单词。 How can I iterate through all the words that match the given regex ? 如何遍历与给定regex匹配的所有单词?

Here is what I tried. 这是我尝试过的。 My String is a text file I am reading and it looks like this. 我的String是我正在阅读的文本文件,它看起来像这样。

infile >> name; 

infile >> Id;
cout << name << " " << Id << endl;
hwp = compute_hw_participation (infile);
tests = compute_tests(tests, infile);
totalscore = compute_totalscore (totalscore, infile);

printRecord (name, Id, hwp, tests, totalscore, outfile);
infile >> name; 

return 0;
}

Additionally, I am trying to find the methods in this String methods are 此外,我试图找到methods在这个String的方法是

compute_hw_participation(infile) compute_hw_participation(infile)

compute_totalscore(totalscore, infile) compute_totalscore(totalscore,infile)

printRecord (name, Id, hwp, tests, total score, outfile) //this method has a space between the method name and parenthesis, I need to get the parenthesis too (until the end of the parenthesis) despite the space, how can I achieve that too? printRecord(名称,Id,hwp,测试,总成绩,输出文件) //此方法在方法名称和括号之间有一个空格,尽管有空格,我也需要获取括号(直到括号的末尾),如何我也实现了吗?

This is what I have tried: 这是我尝试过的:

package com.codeingrams.recursion;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 *
 * @author Jananath Banuka
 */
public class Test {

    private static final Pattern p = Pattern.compile(" [^\\s(]+\\([^)]*\\)|\\S+");

    public static void main(String[] args) {
        String text = "\n"
                + "compute_hw_participation(infile) infile >> name; \n"
                + "while(!infile.eof())\n"
                + "{\n"
                + "infile >> Id;\n"
                + "cout << name << \" \" << Id << endl;\n"
                + "hwp = compute_hw_participation (infile);\n"
                + "tests = compute_tests(tests, infile);\n"
                + "totalscore = compute_totalscore (totalscore, infile);\n"
                + "// grade\n"
                + "printRecord (name, Id, hwp, tests, totalscore, outfile);\n"
                + "infile >> name; \n"
                + "}\n"
                + "\n"
                + "return 0;\n"
                + "}\n"
                + "";

        // create matcher for pattern p and given string
        Matcher m = p.matcher(text);        
        // if an occurrence if a pattern was found in a given string...
        if (m.find()) {
            // ...then you can use group() methods.
            System.out.println(m.group(0)); // gives only infile                        
            System.out.println(m.group(1)); //this gives error arrayIndexoutofBound
        }

    }
}

Output: 输出:

compute_hw_participation(infile) Error: Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1 at java.util.regex.Matcher.group(Matcher.java:538) at com.codeingrams.recursion.Test.main(Test.java:44) compute_hw_participation(infile)错误:线程“ main”中的异常java.lang.IndexOutOfBoundsException:com.codeingrams.recursion.Test.main(测试中,java.util.regex.Matcher.group(Matcher.java:538)处没有组1 .java:44)

You need a pattern that matches function calls, ie, a name, possibly space, an opening parenthesis, some arguments, and a closing parenthesis. 您需要一个与函数调用匹配的模式,即名称,可能的空格,左括号,一些参数和右括号。

Looking at the Javadoc for Pattern you see the character classes you can use in regular expressions. 查看Javadoc for Pattern,您会看到可以在正则表达式中使用的字符类。 You'll need: 你需要:

  • letters or digits or underscores: \\w 字母或数字或下划线: \\w
  • maybe white space: \\s* , the * means zero to many times 可能是空格: \\s**表示零到很多次
  • an opening parenthesis, which you need to escape with a backslash as it has a special meaning in regular expressions: \\( 左括号,您需要用反斜杠转义,因为它在正则表达式中具有特殊含义: \\(
  • some (or none) characters, until you find a closing parenthesis: [^)]* The [ and ] create a group, the ^ is negation, meaning anything but what's in the group. 一些(或没有)字符,直到找到右括号为止: [^)]* []创建一个组, ^是负号,表示该组中什么都没有。
  • the actual closing parenthesis: \\) 实际的右括号: \\)

Then you need to add another backlash to each backslash, as Java Strings also use the backslash for special characters like \\n . 然后,您需要为每个反斜杠添加另一个反斜杠,因为Java字符串还将反斜杠用于特殊字符,例如\\n

You also need to add parenthesis for capturing around the data you're interested in. That is also the reason you had to quote the parenthesis to match them --- unquoted parenthesis means grouping or capturing. 您还需要添加括号以捕获您感兴趣的数据。这也是您必须对括号进行引用以使其匹配的原因-不带括号的括号表示分组或捕获。

With that the total regular expression is (\\w+\\s*\\([^)]*\\)) . 这样,总正则表达式为(\\w+\\s*\\([^)]*\\))

Here's the full program: 这是完整的程序:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test {
    public static void main(String[] args) {
        String text = "\n"
                + "compute_hw_participation(infile) infile >> name; \n"
                + "while(!infile.eof())\n"
                + "{\n"
                + "infile >> Id;\n"
                + "cout << name << \" \" << Id << endl;\n"
                + "hwp = compute_hw_participation (infile);\n"
                + "tests = compute_tests(tests, infile);\n"
                + "totalscore = compute_totalscore (totalscore, infile);\n"
                + "// grade\n"
                + "printRecord (name, Id, hwp, tests, totalscore, outfile);\n"
                + "infile >> name; \n"
                + "}\n"
                + "\n"
                + "return 0;\n"
                + "}\n";
        Pattern p = Pattern.compile("(\\w+\\s*\\([^)]*\\))");
        Matcher m = p.matcher(text);
        while (m.find()) {
            System.out.println(m.group());
        }
    }
}

You'll see the limitations of this simplistic approach very quickly: it also thinks while(!infile.eo() is a function as it looks like a function call. The code does not know about any possible language keywords. Note also that it does not catch the last closing parenthesis in the while expression. That is because it does not count parenthesis and just stops at the first closing parenthesis. The regex also has no clue about comments or Strings, and would happily pick up commented out code or Strings like "foo()" . 您会很快看到这种简单方法的局限性:它还认为while(!infile.eo()是一个函数,因为它看起来像一个函数调用。该代码不知道任何可能的语言关键字。不会在while表达式中捕获最后一个结束括号,这是因为它不计算括号,只是在第一个结束括号处停止,正则表达式也不了解注释或字符串,并且会很高兴地选择注释掉的代码或字符串就像"foo()"

Because of that you're almost always better off using a real parser for the language you are dealing with. 因此,对于要处理的语言,使用真正的解析器几乎总是更好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM