简体   繁体   English

如何在模式匹配中转义“+”以突出显示关键字?

[英]How do I escape '+' in pattern matching to highlight keyword?

I'm implementing a keyword highlighter in Java.我在 Java 中实现了一个关键字荧光笔。 I'm using java.util.regex.Pattern to highlight (making bold) keyword within String content.我正在使用java.util.regex.Pattern在字符串内容中突出显示(加粗)关键字。 The following piece of code is working fine for alphanumeric keywords, but it is not working for some special characters.以下代码适用于字母数字关键字,但不适用于某些特殊字符。 For example, in String content, I would like to highlight the keyword c++ which has the special character + (plus), but it's not getting highlighted properly.例如,在字符串内容中,我想突出显示具有特殊字符 +(加号)的关键字c++ ,但没有正确突出显示。 How do I escape + character so that c++ is highlighted?如何转义+字符以突出显示c++

public static void main(String[] args)
{
    String content = "java,c++,ejb,struts,j2ee,hibernate";
    System.out.println("CONTENT: " + content);
    String highlight = "C++";

    System.out.println("HIGHLIGHT KEYWORD: " + highlight);

    //highlight = highlight.replaceAll(Pattern.quote("+"), "\\\\+");
    java.util.regex.Pattern pattern = java.util.regex.Pattern.compile("\\b" + highlight + "\\b", java.util.regex.Pattern.CASE_INSENSITIVE);
    System.out.println("PATTERN: " + pattern.pattern());
    java.util.regex.Matcher matcher = pattern.matcher(content);

    while (matcher.find()) {
        System.out.println("Match found!!!");
        for (int i = 0; i <= matcher.groupCount(); i++) {
        System.out.println(matcher.group(i));
        content = matcher.replaceAll("<B>" + matcher.group(i) + "</B>");
        }
    }
    System.out.println("RESULT: " + content);
}

Output: Output:
CONTENT: java,c++,ejb,struts,j2ee,hibernate内容:java,c++,ejb,struts,j2ee,hibernate
HIGHLIGHT KEYWORD: C++重点关键字:C++
PATTERN: \bC++\b模式:\bC++\b
Match found!!!匹配找到了!!!
c c
RESULT: java, c ++,ejb,struts,j2ee,hibernate结果:java、 c ++、ejb、struts、j2ee、hibernate


I even tried to escape '+' before calling Pattern.compile like this, 我什至尝试在像这样调用Pattern.compile之前转义“+”,

 highlight = highlight.replaceAll(Pattern.quote("+"), "\\\\+");

but still I'm not able to get the syntax right.但我仍然无法正确使用语法。 Can somebody help me solve this?有人可以帮我解决这个问题吗?

This should do what you need:这应该做你需要的:

Pattern pattern = Pattern.compile(
    "\\b" 
    + Pattern.quote(highlight)
    + "\\b",
    Pattern.CASE_INSENSITIVE);

Update: you are right, the above doesn't work for C++ ( \b matches word boundaries and doesn't recognize ++ as a word).更新:你是对的,以上不适用于 C++ ( \b匹配单词边界并且不将 ++ 识别为单词)。 We need a more complicated solution:我们需要一个更复杂的解决方案:

Pattern pattern = Pattern.compile(
    "\\b" 
    + Pattern.quote(highlight)
    + "(?![^\\p{Punct}\\s])", // matches if the match is not followed by
                              // anything other than whitespace or punctuation
    Pattern.CASE_INSENSITIVE);

Update in response to comments: it seems that you need more logic in your pattern creation.更新以回应评论:似乎您在模式创建中需要更多逻辑。 Here's a helper method to create the pattern for you:这是为您创建模式的辅助方法:

private static final String WORD_BOUNDARY = "\\b";
// edit this to suit your neds:
private static final String ALLOWED = "[^,.!\\-\\s]";
private static final String LOOKAHEAD = "(?!" + ALLOWED + ")";
private static final String LOOKBEHIND = "(?<!" + ALLOWED + ")";

public static Pattern createHighlightPattern(final String highlight) {
    final Pattern pattern = Pattern.compile(
            (Character.isLetterOrDigit(highlight.charAt(0)) 
             ? WORD_BOUNDARY : LOOKBEHIND)
            + Pattern.quote(highlight)
            + (Character.isLetterOrDigit(highlight.charAt(highlight.length() - 1))
             ? WORD_BOUNDARY : LOOKAHEAD),
            Pattern.CASE_INSENSITIVE);
    return pattern;
}

And here is some test code to check that it works:这里有一些测试代码来检查它是否有效:

private static void testMatch(final String haystack, final String needle) {
    final Matcher matcher = createHighlightPattern(needle).matcher(haystack);
    if (!matcher.find())
        System.out.println("Failed to find pattern " + needle);
    while (matcher.find())
        System.out.println("Found additional match: " + matcher.group() +
                           " for pattern " + needle);
}

public static void main(final String[] args) {
    final String testString = "java,c++,hibernate,.net,asp.net,c#,spring";
    testMatch(testString, "java");
    testMatch(testString, "c++");
    testMatch(testString, ".net");
    testMatch(testString, "c#");
}

When I run this method, I don't see any output (which is good:-))当我运行这个方法时,我没有看到任何 output (这很好:-))

The problem is that the \b word boundary anchor is not matching, because + is a non word character and I assume there is a whitespace following that is also a non word character.问题是\b单词边界锚不匹配,因为+是非单词字符,我假设后面有一个空格也是非单词字符。

A word boundary \b is matching a change from a word character (Member in \w ) to a non word character (no member of \w ).单词边界\b匹配从单词字符( \w中的成员)到非单词字符(没有\w成员)的变化。

Also if you want to match a + literally you have to escape it.此外,如果您想从字面上匹配+ ,则必须将其转义。 Here you are searching for C++ that means match at least one C and the ++ is a possessive quantifier matching at least 1 C and does not backtrack.在这里,您正在搜索C++ ,这意味着匹配至少一个C并且++是一个所有格量词,匹配至少 1 个C并且不回溯。

Try changing your pattern to something like this尝试将您的模式更改为这样的

java.util.regex.Pattern.compile("\\b" + highlight + "(?=\s)", java.util.regex.Pattern.CASE_INSENSITIVE);

(?=\s) is a positive lookahead that will check if there is a whitespace following your highlight (?=\s)是一个积极的前瞻,它将检查您的highlight后是否有空格

Additionally you will need to esacape the + your are searching for.此外,您将需要转义您正在搜索的 +。

All you need is here:你需要的都在这里:

Pattern.compile("\\Q"+highlight+"\\E", java.util.regex.Pattern.CASE_INSENSITIVE);

Assuming your keyword does not begin or end with punctuation, here is a commented regex which uses lookahead and lookbehind to achieve your desired matching behavior:假设您的关键字不以标点符号开头或结尾,这里有一个注释正则表达式,它使用前瞻和后瞻来实现您想要的匹配行为:

// Compile regex to match a keyword or keyphrase.
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile(
    "(?<=[\\s'\".?!,;:]|^)  # Word preceded by ws, quote, punct or BOS.\n" +

    // Escape any regex metacharacters in the keyword phrase.
    java.util.regex.Pattern.quote(highlight) + " # Keyword to be matched.\n" +

    "(?=[\\s'\".?!,;:]|$)   # Word followed by ws, quote, punct or EOS.", 
    Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.COMMENTS);

Note that this solution works even if your keyword is a phrase containing spaces.请注意,即使您的关键字是包含空格的短语,此解决方案也有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM