[英]How do I escape '+' in pattern matching to highlight keyword?
I'm implementing a keyword highlighter in Java.我在 Java 中实现了一个关键字荧光笔。 I'm using
java.util.regex.Pattern
to highlight (making bold) keyword within String content.我正在使用
java.util.regex.Pattern
在字符串内容中突出显示(加粗)关键字。 The following piece of code is working fine for alphanumeric keywords, but it is not working for some special characters.以下代码适用于字母数字关键字,但不适用于某些特殊字符。 For example, in String content, I would like to highlight the keyword
c++
which has the special character + (plus), but it's not getting highlighted properly.例如,在字符串内容中,我想突出显示具有特殊字符 +(加号)的关键字
c++
,但没有正确突出显示。 How do I escape +
character so that c++
is highlighted?如何转义
+
字符以突出显示c++
?
public static void main(String[] args)
{
String content = "java,c++,ejb,struts,j2ee,hibernate";
System.out.println("CONTENT: " + content);
String highlight = "C++";
System.out.println("HIGHLIGHT KEYWORD: " + highlight);
//highlight = highlight.replaceAll(Pattern.quote("+"), "\\\\+");
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile("\\b" + highlight + "\\b", java.util.regex.Pattern.CASE_INSENSITIVE);
System.out.println("PATTERN: " + pattern.pattern());
java.util.regex.Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
System.out.println("Match found!!!");
for (int i = 0; i <= matcher.groupCount(); i++) {
System.out.println(matcher.group(i));
content = matcher.replaceAll("<B>" + matcher.group(i) + "</B>");
}
}
System.out.println("RESULT: " + content);
}
Output: Output:
CONTENT: java,c++,ejb,struts,j2ee,hibernate内容:java,c++,ejb,struts,j2ee,hibernate
HIGHLIGHT KEYWORD: C++重点关键字:C++
PATTERN: \bC++\b模式:\bC++\b
Match found!!!匹配找到了!!!
c c
RESULT: java, c ++,ejb,struts,j2ee,hibernate
结果:java、 c ++、ejb、struts、j2ee、hibernate
highlight = highlight.replaceAll(Pattern.quote("+"), "\\\\+");
but still I'm not able to get the syntax right.但我仍然无法正确使用语法。 Can somebody help me solve this?
有人可以帮我解决这个问题吗?
This should do what you need:这应该做你需要的:
Pattern pattern = Pattern.compile(
"\\b"
+ Pattern.quote(highlight)
+ "\\b",
Pattern.CASE_INSENSITIVE);
Update: you are right, the above doesn't work for C++ ( \b
matches word boundaries and doesn't recognize ++ as a word).更新:你是对的,以上不适用于 C++ (
\b
匹配单词边界并且不将 ++ 识别为单词)。 We need a more complicated solution:我们需要一个更复杂的解决方案:
Pattern pattern = Pattern.compile(
"\\b"
+ Pattern.quote(highlight)
+ "(?![^\\p{Punct}\\s])", // matches if the match is not followed by
// anything other than whitespace or punctuation
Pattern.CASE_INSENSITIVE);
Update in response to comments: it seems that you need more logic in your pattern creation.更新以回应评论:似乎您在模式创建中需要更多逻辑。 Here's a helper method to create the pattern for you:
这是为您创建模式的辅助方法:
private static final String WORD_BOUNDARY = "\\b";
// edit this to suit your neds:
private static final String ALLOWED = "[^,.!\\-\\s]";
private static final String LOOKAHEAD = "(?!" + ALLOWED + ")";
private static final String LOOKBEHIND = "(?<!" + ALLOWED + ")";
public static Pattern createHighlightPattern(final String highlight) {
final Pattern pattern = Pattern.compile(
(Character.isLetterOrDigit(highlight.charAt(0))
? WORD_BOUNDARY : LOOKBEHIND)
+ Pattern.quote(highlight)
+ (Character.isLetterOrDigit(highlight.charAt(highlight.length() - 1))
? WORD_BOUNDARY : LOOKAHEAD),
Pattern.CASE_INSENSITIVE);
return pattern;
}
And here is some test code to check that it works:这里有一些测试代码来检查它是否有效:
private static void testMatch(final String haystack, final String needle) {
final Matcher matcher = createHighlightPattern(needle).matcher(haystack);
if (!matcher.find())
System.out.println("Failed to find pattern " + needle);
while (matcher.find())
System.out.println("Found additional match: " + matcher.group() +
" for pattern " + needle);
}
public static void main(final String[] args) {
final String testString = "java,c++,hibernate,.net,asp.net,c#,spring";
testMatch(testString, "java");
testMatch(testString, "c++");
testMatch(testString, ".net");
testMatch(testString, "c#");
}
When I run this method, I don't see any output (which is good:-))当我运行这个方法时,我没有看到任何 output (这很好:-))
The problem is that the \b
word boundary anchor is not matching, because +
is a non word character and I assume there is a whitespace following that is also a non word character.问题是
\b
单词边界锚不匹配,因为+
是非单词字符,我假设后面有一个空格也是非单词字符。
A word boundary \b
is matching a change from a word character (Member in \w
) to a non word character (no member of \w
).单词边界
\b
匹配从单词字符( \w
中的成员)到非单词字符(没有\w
成员)的变化。
Also if you want to match a +
literally you have to escape it.此外,如果您想从字面上匹配
+
,则必须将其转义。 Here you are searching for C++
that means match at least one C
and the ++
is a possessive quantifier matching at least 1 C
and does not backtrack.在这里,您正在搜索
C++
,这意味着匹配至少一个C
并且++
是一个所有格量词,匹配至少 1 个C
并且不回溯。
Try changing your pattern to something like this尝试将您的模式更改为这样的
java.util.regex.Pattern.compile("\\b" + highlight + "(?=\s)", java.util.regex.Pattern.CASE_INSENSITIVE);
(?=\s)
is a positive lookahead that will check if there is a whitespace following your highlight
(?=\s)
是一个积极的前瞻,它将检查您的highlight
后是否有空格
Additionally you will need to esacape the + your are searching for.此外,您将需要转义您正在搜索的 +。
All you need is here:你需要的都在这里:
Pattern.compile("\\Q"+highlight+"\\E", java.util.regex.Pattern.CASE_INSENSITIVE);
Assuming your keyword does not begin or end with punctuation, here is a commented regex which uses lookahead and lookbehind to achieve your desired matching behavior:假设您的关键字不以标点符号开头或结尾,这里有一个注释正则表达式,它使用前瞻和后瞻来实现您想要的匹配行为:
// Compile regex to match a keyword or keyphrase.
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile(
"(?<=[\\s'\".?!,;:]|^) # Word preceded by ws, quote, punct or BOS.\n" +
// Escape any regex metacharacters in the keyword phrase.
java.util.regex.Pattern.quote(highlight) + " # Keyword to be matched.\n" +
"(?=[\\s'\".?!,;:]|$) # Word followed by ws, quote, punct or EOS.",
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.COMMENTS);
Note that this solution works even if your keyword is a phrase containing spaces.请注意,即使您的关键字是包含空格的短语,此解决方案也有效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.