简体   繁体   中英

My regular expression is getting white space after words and shouldn't

I have a regular expression that is supposed to capture Upper case words. So if there are one or more words that are all uppercase the regex finds it. But I also have another regular expression that captures One word all uppercase word. For some reason the first reg ex is capturing one word all uppercase with a traling white space at the end. Here is my code.

    //This looks for All Cap Words inside parens-completed
    String ucParensRegEx = "\\([A-Z]+\\)";
    regexParser(we, ucParensRegEx);
    //This looks for All Upper case words with two or more letters.- completed
    String twoPlusUCRegEx = "[A-Z][A-Z]+";
    regexParser(we, twoPlusUCRegEx);

    String letNumRegEx = "[A-Z][A-Z0-9][A-Z]+";
    regexParser(we, letNumRegEx);

    //Looks for Uppercase words that start with a number-Completed
    String numLetRegEx = "[0-9][A-Z][A-Z]+";
    regexParser(we, numLetRegEx);

    String upperwhitespaceRegEx = "(\\b[A-Z'][A-Z]+\\b\\s*)+";
    regexParser(we, upperwhitespaceRegEx);

private void regexParser(WordExtractor we, String regex) {
    if (we.getParagraphText() != null) {
        String[] dataArray = we.getParagraphText();

        for (int i = 0; i < dataArray.length; i++) {
            String data = dataArray[i].toString();
            Pattern p = Pattern.compile(regex);
            Matcher m = p.matcher(data);
            while (m.find()) {
                if (!sequences.contains(data.substring(m.start(), m.end())) && !data.equals("US ") && !data.contains("ARABIC") && !data.contains("ALATEC") && !data.contains("HYPERLINK")) {
                    sequences.add(data.substring(m.start(), m.end()));
                    System.out.println(data.substring(m.start(), m.end()));
                    Acronym acc = new Acronym(data.substring(m.start(), m.end()), data, "", false);
                    newAcList.add(acc);
                }
            }
        }
    }
}
"\\b[A-Z'][A-Z]+(\s+[A-Z'][A-Z]+)*\\b"

The interior word boundaries are unnecessary (as \\s[AZ] will definitionally have a word boundary between the whitespace and the Upper case letter). So all you need to do is match an uppercase word, then optionally match a bunch of other uppercase words after it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM