C# 正則表達式在 HTML 中查找短語排除特定標簽

Question

我有 HTML 文本，我有一個特定的短語。 我需要在 HTML 文本中找到我的短語並突出顯示它，但如果它在h或a標簽內，我必須跳過我的短語

例如：這是我的短語：“要突出顯示的短語”

這是我的 HTML 文本

<p>Here starts text and here is phrase to highlight</p>
<a>here, phrase to highlight, supposed to be skipped</a>
<h3>here, phrase to highlight, supposed to be skipped</h3>
<div class="phrase to highlight">Here phrase to highlight must be highlighted again</div>

p 和 div 標簽應該突出我的短語，a 和任何 h 標簽應該跳過我的短語。

我做了負面的回顧以找到我的短語並確保它不是 HTML 屬性

var Pattern = $"(?i)(?<!</?[^>]*|&[^;]*)(\bphrase to highlight\b)";

如何修改我的正則表達式以排除 a 和 h 標簽？

Answer 1

如果沒有嵌套的a和h標簽，請使用

(?<!</?[^>]*|&[^;]*|<(?:a|h\d)(?:\s[^>]*)?>[^<]*)(\bphrase to highlight\b)

見證明。

解釋

--------------------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
    <                        '<'
--------------------------------------------------------------------------------
    /?                       '/' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    [^>]*                    any character except: '>' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    &                        '&'
--------------------------------------------------------------------------------
    [^;]*                    any character except: ';' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    <                        '<'
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      a                        'a'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      h                        'h'
--------------------------------------------------------------------------------
      \d                       digits (0-9)
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      \s                       whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
      [^>]*                    any character except: '>' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
    >                        '>'
--------------------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
    phrase to                'phrase to highlight'
    highlight
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
  )                        end of \1

C# 正則表達式在 HTML 中查找短語排除特定標簽

問題描述

1 個解決方案

解決方案1
1 已采納 2020-11-02 20:29:03

C# 正則表達式在 HTML 中查找短語排除特定標簽

問題描述

1 個解決方案

解決方案1 1 已采納 2020-11-02 20:29:03

解決方案1
1 已采納 2020-11-02 20:29:03