简体   繁体   English

从正则表达式匹配中突出显示单词

[英]Highlight Words from a Regex Match

I am trying to search a paragraph for certain text with Regex . 我正在尝试使用正则Regex搜索段落中的某些文本。 I'd like the realist to return X number of words before and after and add highlights around all the occurrences of the text with. 我希望现实主义者在之前和之后返回X个单词,并在文本的所有出现处添加高亮显示。

For Example : Consider the following paragraph. 例如 :考虑以下段落。 The result should have at least 10 characters before and after with no words cut off. 结果应该至少有10个字符前后不要切断单词。 The search term is "dog". 搜索词是“dog”。

The Dog is a pet animal. 狗是宠物。 It is one of the most obedient animals. 它是最听话的动物之一。 There are many kinds of dogs in the world. 世界上有很多种狗。 Some of the are very friendly while some of them a dangerous. 有些非常友好,有些则很危险。 Dogs are of different color like black, red, white and brown. 狗有不同的颜色,如黑色,红色,白色和棕色。 Some old them have slippery shiny skin and some have rough skin. 有些老人皮肤光滑,有些皮肤粗糙。 Dogs are carnivorous animals. 狗是食肉动物。 They like eating meat. 他们喜欢吃肉。 They have four legs, two ears and a tail. 他们有四条腿,两只耳朵和一条尾巴。 Dogs are trained to perform different tasks. 训练狗进行不同的任务。 They protect us from thieves b) guarding our house. 他们保护我们免受小偷b)守卫我们的房子。 They are loving animals. 他们是爱动物。 A dog is called man's best friend. 一只狗被称为男人最好的朋友。 They are used by the police to find hidden things. 他们被警方用来寻找隐藏的东西。 They are one of the most useful animals in the world. 它们是世界上最有用的动物之一。 Doggonit! Doggonit!

The result I desire is an array with that looks like the following: 我想要的结果是一个数组,如下所示:

  • The Dog is a pet animal 是宠物
  • many kinds of dog s in the world 世界上有很多种
  • dangerous. 危险的。 Dog s are of different 是不同的
  • rough skin. 粗糙的皮肤。 Dog s are carnivorous 是肉食性的
  • and a tail. 还有一条尾巴。 Dog s are trained 是训练有素的
  • animals. 动物。 A dog is called 一条狗
  • the world. 世界。 Dog gonit! gonit!

What I've Got: 我得到了什么:

I've search around and have found the following regex that has perfectly returned the results as desired but without adding extra formatting. 我搜索并发现以下正则表达式已完全返回所需的结果,但没有添加额外的格式。 I created several methods to facilitate each functionality: 我创建了几种方法来促进每个功能:

private List<List<string>> Search(string text, string searchTerm, bool searchEntireWord) {
    var result = new List<List<string>>();
    var searchTerms = searchTerm.Split(' ');
        foreach (var word in searchTerms) {
            var searchResults = ExtractParagraph(text, word, sizeOfResult, searchEntireWord);
            result.Add(searchResults);
            if (searchResults.Count > 0) {
                foreach (var searchResult in searchResults) {
                    Response.Write("<strong>Result:</strong> " + searchResult + "<br>");
                }
            }
        }
    return result;
}

private List<string> ExtractParagraph(string text, string searchTerm, sizeOfResult, bool searchEntireWord) {
    var result = new List<string>();
    searchTerm = searchEntireWord ? @"\b" + searchTerm + @"\b" : searchTerm;
    //var expression = @"((^.{0,30}|\w*.{30})\b" + searchTerm + @"\b(.{30}\w*|.{0,30}$))";
    var expression = @"((^.{0," + sizeOfResult + @"}|\w*.{" + sizeOfResult + @"})" + searchTerm + @"(.{" + sizeOfResult + @"}\w*|.{0," + sizeOfResult + @"}$))";
    var wordMatch = new Regex(expression, RegexOptions.IgnoreCase | RegexOptions.Singleline);

    foreach (Match m in wordMatch.Matches(text)) {
        result.Add(m.Value);
    }
    return result;
}

And I can call it like: 我可以称之为:

var text = "The Dog is a pet animal. It is one of...";
var searchResults = Search(text, "dog", 10);
if (searchResults.Count > 0) {
    foreach (var searchResult in searchResults) {
        foreach (var result in searchResult) {
            Response.Write("<strong>Result:</strong> " + result + "<br>");
        }
    }
}

I don't know yet the result of, or how to deal with, multiple occurrences of the word within the 10 characters. 我还不知道10个字符内多次出现该单词的结果或如何处理。 ie: if a sentence had "A dog is a dog of course!". 即:如果一句话“狗当然是狗!”。 I guess I can deal with that later. 我想我以后可以处理。

Tests: 测试:

var searchResults = Search(text, "dog", 0, false); // should include only the matched word
var searchResults = Search(text, "dog", 1, false); // should include the matched word and only one word preceding and following the matched word (if any)
var searchResults = Search(text, "dog", 10, false); // should include the matched word and up to 10 characters (but not cutting off words in the middle) preceding and following it (if any)
var searchResults = Search(text, "dog", 50, false); // should include the matched word and up to 50 characters (but not cutting off words in the middle) preceding and following it (if any)

Issues: 问题:

The function I created allows the search to find the searchTerm as a whole word only or part of the word. 我创建的函数允许搜索查找searchTerm作为整个单词或单词的一部分。

What I was doing was a simple Replace(word, "<strong>" + word "</strong>") on the results when displaying them. 我在做什么是在显示结果时对结果进行简单的Replace(word, "<strong>" + word "</strong>") This works great if I was searching for parts of the word. 如果我正在搜索单词的一部分,这很有用。 But when searching for whole words, if the result included the searchTerm as part of the word, that part of the word would highlight. 但是当搜索整个单词时,如果结果包含searchTerm作为单词的一部分,则该单词的那一部分将突出显示。

For example: if I was searching for "dog" and the result was: "All dogs go to dog heaven." 例如:如果我在搜索“狗”,结果是:“所有的狗都去了天堂。” The highlighting would come out as "All dog s go to dog heaven." 突出显示的是“所有的狗都天堂。” But I want "All dogs go to dog heaven." 但我想“所有的狗去的天堂。”

Question: 题:

The question is how can I get the matched word wrapped with some HTML like <strong> or anything else I'd want? 问题是如何将匹配的单词包含在某些HTML中,如<strong>或我想要的任何其他内容?

Your solution should be able to do two main things: 1) extract the matches, ie keywords/phrases plus additional left- and right-hand contexts round them, and 2) wrap the search terms with tags. 您的解决方案应该能够做两件事:1)提取匹配,即关键字/短语以及围绕它们的附加左手和右手上下文,以及2)用标签包装搜索词。

The extraction regex (for, say, 10 chars on the left and right) is 提取正则表达式(例如,左边和右边的10个字符)是

(?si)(?<!\S).{0,10}(?<!\S)\S*dog\S*(?!\S).{0,10}(?!\S)

See the regex demo . 请参阅正则表达式演示

Details 细节

  • (?si) - enable Singleline and IgnoreCase modifiers ( . will match all chars and the pattern will be case insensitive) (?si) - 启用SinglelineIgnoreCase修饰符( .将匹配所有字符,模式将不区分大小写)
  • (?<!\\S) - a left-hand whitespace boundary (?<!\\S) - 左手空白边界
  • .{0,10} - 0 to 10 chars .{0,10} - 0到10个字符
  • (?<!\\S) - a left-hand whitespace boundary (?<!\\S) - 左手空白边界
  • \\S*dog\\S* - dog with any 0+ non-whitespace chars around it ( NOTE : if searchEntireWord is false , you need to remove \\S* from this pattern part) \\S*dog\\S* - 其周围有任何0+非空白字符的dog注意 :如果searchEntireWordfalse ,则需要从此模式部分中删除\\S*
  • (?!\\S) - a right-hand whitespace boundary (?!\\S) - 右手空白边界
  • .{0,10} - 0 to 10 chars .{0,10} - 0到10个字符
  • (?!\\S) - a right-hand whitespace boundary. (?!\\S) - 右手空白边界。

In C#, it will be defined as 在C#中,它将被定义为

var expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S)\S*{1}\S*(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
if (searchEntireWord) { 
    expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S){1}(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
} 

Note that the {{ is actually a literal { and }} is a literal } in the formatted string. 请注意,格式化字符串中的{{实际上是文字{}}是文字}

The second regex to wrap the key terms with strong tags is much simpler: 用强标签包装关键术语的第二个正则表达式要简单得多:

Regex.Replace(x.Value, 
            searchEntireWord ? 
                string.Format(@"(?i)(?<!\S){0}(?!\S)", Regex.Escape(searchTerm)) : 
                string.Format(@"(?i){0}", Regex.Escape(searchTerm)), 
            "<strong>$&</strong>")

Note that $& in the replacement pattern refers to the whole match value. 请注意,替换模式中的$&是指整个匹配值。

C# code: C#代码:

public static List<string> ExtractTexts(string text, string searchTerm, int sizeOfResult, bool searchEntireWord) 
{
    var expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S)\S*{1}\S*(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
    if (searchEntireWord) { 
        expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S){1}(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
    } 
    return Regex.Matches(text, expression) 
        .Cast<Match>() 
        .Select(x => Regex.Replace(x.Value, 
            searchEntireWord ? 
                string.Format(@"(?i)(?<!\S){0}(?!\S)", Regex.Escape(searchTerm)) : 
                string.Format(@"(?i){0}", Regex.Escape(searchTerm)), 
            "<strong>$&</strong>"))
        .ToList();
}

Sample usage (see demo) : 样本用法(见演示)

var text = "The Dog is a real-pet animal. There's an undogging dog that only undogs non-dogs. It is one of the most obedient animals. There are many kinds of dogs in the world. Some of the are very friendly while some of them a dangerous. Dogs are of different color like black, red, white and brown. Some old them have slippery shiny skin and some have rough skin. Dogs are carnivorous animals. They like eating meat. They have four legs, two ears and a tail. Dogs are trained to perform different tasks. They protect us from thieves b) guarding our house. They are loving animals. A dog is called man's best friend. They are used by the police to find hidden things. They are one of the most useful animals in the world. Doggonit!";
var searchTerm = "dog";
var searchEntireWord = false;
Console.WriteLine("======= 10 ========");
var results = ExtractTexts(text, searchTerm, 10, searchEntireWord);
foreach (var result in results)
    Console.WriteLine(result);

Output: 输出:

======= 10 ========
(?si)(?<!\S).{0,10}(?<!\S)\S*dog\S*(?!\S).{0,10}(?!\S)
The <strong>Dog</strong> is a
an un<strong>dog</strong>ging <strong>dog</strong> that
only un<strong>dog</strong>s non-<strong>dog</strong>s.
kinds of <strong>dog</strong>s in the
<strong>Dog</strong>s are of
skin. <strong>Dog</strong>s are
a tail. <strong>Dog</strong>s are
A <strong>dog</strong> is called
world. <strong>Dog</strong>gonit!

Another example: 另一个例子:

Console.WriteLine("======= 15 ========");
results = ExtractTexts(text, searchTerm, 15, searchEntireWord);
foreach (var result in results)
    Console.WriteLine(result);

Output: 输出:

======= 15 ========
(?si)(?<!\S).{0,15}(?<!\S)\S*dog\S*(?!\S).{0,15}(?!\S)
The <strong>Dog</strong> is a real-pet
There's an un<strong>dog</strong>ging <strong>dog</strong> that only
un<strong>dog</strong>s non-<strong>dog</strong>s. It is one of
many kinds of <strong>dog</strong>s in the world.
a dangerous. <strong>Dog</strong>s are of
rough skin. <strong>Dog</strong>s are
and a tail. <strong>Dog</strong>s are trained to
animals. A <strong>dog</strong> is called
in the world. <strong>Dog</strong>gonit!

A simple solution using Regex.Replace : 使用Regex.Replace的简单解决方案:

public bool HighlightExactMatchOnly(string input, string textToHighlight, string expected)
{
    // given
    var escapedHighlight = Regex.Escape(textToHighlight);

    // when
    var result = Regex.Replace(input, @"\b" + escapedHighlight + @"\b", "<strong>$0</strong>");

    return expected == result;
}

Test: 测试:

var text = "My test dogs with a single dog and some text behind";
var expected = "My test dogs with a single <strong>dog</strong> and some text behind";
HighlightExactMatchOnly(text , "dog", expected);

Please note that this is not the fastest possible solution. 请注意,这不是最快的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM