簡體   English   中英

從正則表達式匹配中突出顯示單詞

[英]Highlight Words from a Regex Match

我正在嘗試使用正則Regex搜索段落中的某些文本。 我希望現實主義者在之前和之后返回X個單詞,並在文本的所有出現處添加高亮顯示。

例如 :考慮以下段落。 結果應該至少有10個字符前后不要切斷單詞。 搜索詞是“dog”。

狗是寵物。 它是最聽話的動物之一。 世界上有很多種狗。 有些非常友好,有些則很危險。 狗有不同的顏色,如黑色,紅色,白色和棕色。 有些老人皮膚光滑,有些皮膚粗糙。 狗是食肉動物。 他們喜歡吃肉。 他們有四條腿,兩只耳朵和一條尾巴。 訓練狗進行不同的任務。 他們保護我們免受小偷b)守衛我們的房子。 他們是愛動物。 一只狗被稱為男人最好的朋友。 他們被警方用來尋找隱藏的東西。 它們是世界上最有用的動物之一。 Doggonit!

我想要的結果是一個數組,如下所示:

  • 是寵物
  • 世界上有很多種
  • 危險的。 是不同的
  • 粗糙的皮膚。 是肉食性的
  • 還有一條尾巴。 是訓練有素的
  • 動物。 一條狗
  • 世界。 gonit!

我得到了什么:

我搜索並發現以下正則表達式已完全返回所需的結果,但沒有添加額外的格式。 我創建了幾種方法來促進每個功能:

private List<List<string>> Search(string text, string searchTerm, bool searchEntireWord) {
    var result = new List<List<string>>();
    var searchTerms = searchTerm.Split(' ');
        foreach (var word in searchTerms) {
            var searchResults = ExtractParagraph(text, word, sizeOfResult, searchEntireWord);
            result.Add(searchResults);
            if (searchResults.Count > 0) {
                foreach (var searchResult in searchResults) {
                    Response.Write("<strong>Result:</strong> " + searchResult + "<br>");
                }
            }
        }
    return result;
}

private List<string> ExtractParagraph(string text, string searchTerm, sizeOfResult, bool searchEntireWord) {
    var result = new List<string>();
    searchTerm = searchEntireWord ? @"\b" + searchTerm + @"\b" : searchTerm;
    //var expression = @"((^.{0,30}|\w*.{30})\b" + searchTerm + @"\b(.{30}\w*|.{0,30}$))";
    var expression = @"((^.{0," + sizeOfResult + @"}|\w*.{" + sizeOfResult + @"})" + searchTerm + @"(.{" + sizeOfResult + @"}\w*|.{0," + sizeOfResult + @"}$))";
    var wordMatch = new Regex(expression, RegexOptions.IgnoreCase | RegexOptions.Singleline);

    foreach (Match m in wordMatch.Matches(text)) {
        result.Add(m.Value);
    }
    return result;
}

我可以稱之為:

var text = "The Dog is a pet animal. It is one of...";
var searchResults = Search(text, "dog", 10);
if (searchResults.Count > 0) {
    foreach (var searchResult in searchResults) {
        foreach (var result in searchResult) {
            Response.Write("<strong>Result:</strong> " + result + "<br>");
        }
    }
}

我還不知道10個字符內多次出現該單詞的結果或如何處理。 即:如果一句話“狗當然是狗!”。 我想我以后可以處理。

測試:

var searchResults = Search(text, "dog", 0, false); // should include only the matched word
var searchResults = Search(text, "dog", 1, false); // should include the matched word and only one word preceding and following the matched word (if any)
var searchResults = Search(text, "dog", 10, false); // should include the matched word and up to 10 characters (but not cutting off words in the middle) preceding and following it (if any)
var searchResults = Search(text, "dog", 50, false); // should include the matched word and up to 50 characters (but not cutting off words in the middle) preceding and following it (if any)

問題:

我創建的函數允許搜索查找searchTerm作為整個單詞或單詞的一部分。

我在做什么是在顯示結果時對結果進行簡單的Replace(word, "<strong>" + word "</strong>") 如果我正在搜索單詞的一部分,這很有用。 但是當搜索整個單詞時,如果結果包含searchTerm作為單詞的一部分,則該單詞的那一部分將突出顯示。

例如:如果我在搜索“狗”,結果是:“所有的狗都去了天堂。” 突出顯示的是“所有的狗都天堂。” 但我想“所有的狗去的天堂。”

題:

問題是如何將匹配的單詞包含在某些HTML中,如<strong>或我想要的任何其他內容?

您的解決方案應該能夠做兩件事:1)提取匹配,即關鍵字/短語以及圍繞它們的附加左手和右手上下文,以及2)用標簽包裝搜索詞。

提取正則表達式(例如,左邊和右邊的10個字符)是

(?si)(?<!\S).{0,10}(?<!\S)\S*dog\S*(?!\S).{0,10}(?!\S)

請參閱正則表達式演示

細節

  • (?si) - 啟用SinglelineIgnoreCase修飾符( .將匹配所有字符,模式將不區分大小寫)
  • (?<!\\S) - 左手空白邊界
  • .{0,10} - 0到10個字符
  • (?<!\\S) - 左手空白邊界
  • \\S*dog\\S* - 其周圍有任何0+非空白字符的dog注意 :如果searchEntireWordfalse ,則需要從此模式部分中刪除\\S*
  • (?!\\S) - 右手空白邊界
  • .{0,10} - 0到10個字符
  • (?!\\S) - 右手空白邊界。

在C#中,它將被定義為

var expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S)\S*{1}\S*(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
if (searchEntireWord) { 
    expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S){1}(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
} 

請注意,格式化字符串中的{{實際上是文字{}}是文字}

用強標簽包裝關鍵術語的第二個正則表達式要簡單得多:

Regex.Replace(x.Value, 
            searchEntireWord ? 
                string.Format(@"(?i)(?<!\S){0}(?!\S)", Regex.Escape(searchTerm)) : 
                string.Format(@"(?i){0}", Regex.Escape(searchTerm)), 
            "<strong>$&</strong>")

請注意,替換模式中的$&是指整個匹配值。

C#代碼:

public static List<string> ExtractTexts(string text, string searchTerm, int sizeOfResult, bool searchEntireWord) 
{
    var expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S)\S*{1}\S*(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
    if (searchEntireWord) { 
        expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S){1}(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
    } 
    return Regex.Matches(text, expression) 
        .Cast<Match>() 
        .Select(x => Regex.Replace(x.Value, 
            searchEntireWord ? 
                string.Format(@"(?i)(?<!\S){0}(?!\S)", Regex.Escape(searchTerm)) : 
                string.Format(@"(?i){0}", Regex.Escape(searchTerm)), 
            "<strong>$&</strong>"))
        .ToList();
}

樣本用法(見演示)

var text = "The Dog is a real-pet animal. There's an undogging dog that only undogs non-dogs. It is one of the most obedient animals. There are many kinds of dogs in the world. Some of the are very friendly while some of them a dangerous. Dogs are of different color like black, red, white and brown. Some old them have slippery shiny skin and some have rough skin. Dogs are carnivorous animals. They like eating meat. They have four legs, two ears and a tail. Dogs are trained to perform different tasks. They protect us from thieves b) guarding our house. They are loving animals. A dog is called man's best friend. They are used by the police to find hidden things. They are one of the most useful animals in the world. Doggonit!";
var searchTerm = "dog";
var searchEntireWord = false;
Console.WriteLine("======= 10 ========");
var results = ExtractTexts(text, searchTerm, 10, searchEntireWord);
foreach (var result in results)
    Console.WriteLine(result);

輸出:

======= 10 ========
(?si)(?<!\S).{0,10}(?<!\S)\S*dog\S*(?!\S).{0,10}(?!\S)
The <strong>Dog</strong> is a
an un<strong>dog</strong>ging <strong>dog</strong> that
only un<strong>dog</strong>s non-<strong>dog</strong>s.
kinds of <strong>dog</strong>s in the
<strong>Dog</strong>s are of
skin. <strong>Dog</strong>s are
a tail. <strong>Dog</strong>s are
A <strong>dog</strong> is called
world. <strong>Dog</strong>gonit!

另一個例子:

Console.WriteLine("======= 15 ========");
results = ExtractTexts(text, searchTerm, 15, searchEntireWord);
foreach (var result in results)
    Console.WriteLine(result);

輸出:

======= 15 ========
(?si)(?<!\S).{0,15}(?<!\S)\S*dog\S*(?!\S).{0,15}(?!\S)
The <strong>Dog</strong> is a real-pet
There's an un<strong>dog</strong>ging <strong>dog</strong> that only
un<strong>dog</strong>s non-<strong>dog</strong>s. It is one of
many kinds of <strong>dog</strong>s in the world.
a dangerous. <strong>Dog</strong>s are of
rough skin. <strong>Dog</strong>s are
and a tail. <strong>Dog</strong>s are trained to
animals. A <strong>dog</strong> is called
in the world. <strong>Dog</strong>gonit!

使用Regex.Replace的簡單解決方案:

public bool HighlightExactMatchOnly(string input, string textToHighlight, string expected)
{
    // given
    var escapedHighlight = Regex.Escape(textToHighlight);

    // when
    var result = Regex.Replace(input, @"\b" + escapedHighlight + @"\b", "<strong>$0</strong>");

    return expected == result;
}

測試:

var text = "My test dogs with a single dog and some text behind";
var expected = "My test dogs with a single <strong>dog</strong> and some text behind";
HighlightExactMatchOnly(text , "dog", expected);

請注意,這不是最快的解決方案。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM