简体   繁体   English

什么正则表达式将匹配文本,不包括HTML标记内的内容?

[英]What regex will match text excluding what lies within HTML tags?

I am writing code for a search results page that needs to highlight search terms. 我正在为搜索结果页面编写代码,需要突出显示搜索字词。 The terms happen to occur within table cells (the app is iterating through GridView Row Cells), and these table cells may have HTML. 这些术语碰巧发生在表格单元格中(应用程序正在迭代GridView行单元格),这些表格单元格可能包含HTML。

Currently, my code looks like this (relevant hunks shown below): 目前,我的代码看起来像这样(相关的帅哥如下所示):

const string highlightPattern = @"<span class=""Highlight"">$0</span>";
DataBoundLiteralControl litCustomerComments = (DataBoundLiteralControl)e.Row.Cells[CUSTOMERCOMMENTS_COLUMN].Controls[0];

// Turn "term1 term2" into "(term1|term2)"
string spaceDelimited = txtTextFilter.Text.Trim();
string pipeDelimited = string.Join("|", spaceDelimited.Split(new[] {" "}, StringSplitOptions.RemoveEmptyEntries));
string searchPattern = "(" + pipeDelimited + ")";

// Highlight search terms in Customer - Comments column
e.Row.Cells[CUSTOMERCOMMENTS_COLUMN].Text = Regex.Replace(litCustomerComments.Text, searchPattern, highlightPattern, RegexOptions.IgnoreCase);

Amazingly it works. 令人惊讶的是它有效。 BUT, sometimes the text I am matching on is HTML that looks like this: 但是,有时我匹配的文本是HTML,如下所示:

<span class="CustomerName">Fred</span> was a classy individual.

And if you search for "class" I want the highlight code to wrap the "class" in "classy" but of course not the HTML attribute "class" that happens to be in there! 如果你搜索“类”我希望突出显示代码将“class”包装在“classy”中,但当然不是HTML属性“class”恰好在那里! If you search for "Fred", that should be highlighted. 如果您搜索“Fred”,则应突出显示。

So what's a good regex that will make sure matches happen only OUTSIDE the html tags? 那么什么是一个好的正则表达式,以确保匹配只发生在html标签之外? It doesn't have to be super hardcore. 它不一定是超级铁杆。 Simply making sure the match is not between < and > would work fine, I think. 我认为,只需确保匹配不在<和>之间就行了。

This regex should do the job : (?<!<[^>]*)(regex you want to check: Fred|span) It checks that it is impossible to match the regex <[^>]* going backward starting from a matching string. 这个正则表达式应该完成这个工作: (?<!<[^>]*)(regex you want to check: Fred|span)它检查是否无法匹配正则表达式<[^>]*从a开始向后匹配字符串。

Modified code below: 修改后的代码:

const string notInsideBracketsRegex = @"(?<!<[^>]*)";
const string highlightPattern = @"<span class=""Highlight"">$0</span>";
DataBoundLiteralControl litCustomerComments = (DataBoundLiteralControl)e.Row.Cells[CUSTOMERCOMMENTS_COLUMN].Controls[0];

// Turn "term1 term2" into "(term1|term2)"
string spaceDelimited = txtTextFilter.Text.Trim();
string pipeDelimited = string.Join("|", spaceDelimited.Split(new[] {" "}, StringSplitOptions.RemoveEmptyEntries));
string searchPattern = "(" + pipeDelimited + ")";
searchPattern = notInsideBracketsRegex + searchPattern;

// Highlight search terms in Customer - Comments column
e.Row.Cells[CUSTOMERCOMMENTS_COLUMN].Text = Regex.Replace(litCustomerComments.Text, searchPattern, highlightPattern, RegexOptions.IgnoreCase);

您可以使用正则表达式来平衡组和反向引用,但我强烈建议您在此处使用解析器

Hmm, I'm not a C# programmer so I don't know the flavor of regex it uses but (?!<.+?>) should ignore anything inside of tags. 嗯,我不是C#程序员所以我不知道它使用的正则表达式的味道但是(?!<。+?>)应该忽略标签内的任何内容。 It will force you to use &#60 &#62 in your HTML code, but you should be doing that anyway. 它将强制您在HTML代码中使用&#60&#62,但无论如何您应该这样做。

Writing a regex that can handle CDATA sections is going to be hard. 编写一个可以处理CDATA部分的正则表达式会很难。 You may no longer asssume that > closes a tag. 您可能不再认为>关闭标签。

For instance, "<span class="CustomerName>Fred.</span> is a good customer (<![CDATA[ >10000$ ]]> )" 例如, "<span class="CustomerName>Fred.</span> is a good customer (<![CDATA[ >10000$ ]]> )"

The solution is (as noted earlier) a parser. 解决方案是(如前所述)解析器。 They're much better in dealing with the kind of mess you find in a CDATA . 他们在处理你在CDATA中遇到的那种混乱方面要好得多。 madgnome's backwards check cannot be used to find the starting <![CDATA from a ]]> , as a CDATA section may include the literal <![CDATA . madgnome的向后检查不能用于找到起始<![CDATA from a ]]> ,因为CDATA部分可能包含文字<![CDATA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM