简体   繁体   English

用于HTML的正则表达式C#

[英]Regex expression C# for HTML

I have following regex: 我有以下正则表达式:

^(<span style=.*?font-weight:bold.*?>.*?</span>)

It matches the following code: 它与以下代码匹配:

<span style="font-family:Arial; font-size:10pt"> r.</span></p><p style="margin:0pt"><span style="font-family:Arial; font-size:10pt; font-weight:bold">&#xa0;</span>

But I would like to match only this part (last span containing font-weight:bold style) 但是我只想匹配这部分(最后一个跨度包含font-weight:bold样式)

<span style="font-family:Arial; font-size:10pt; font-weight:bold">&#xa0;</span>

Use HTML Agility Pack to parse html: 使用HTML Agility Pack解析html:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);

var boldSpans = from s in doc.DocumentNode.SelectNodes("//span")
                let style = s.Attributes["style"].Value
                where style.Contains("font-weight:bold")
                select s;

Or even better xpath, which selects all nodes in one line: 甚至更好的xpath,它在一行中选择所有节点:

doc.DocumentNode.SelectNodes("//span[contains(@style, 'font-weight:bold')]")

Don't use ^ since the line doesn't start with the span you want to match. 不要使用^因为该行不是以您要匹配的跨度开头。

<span style=["'][^'"]*font-weight:bold[^'"]*['"]>[^<]*</span>

Or as escaped string: 或作为转义字符串:

"<span style=[\"'][^'\"]*font-weight:bold[^'\"]*['\"]>[^<]*</span>"

This matches strings starting with <span style= followed by single or double quote ' , " . Then [^'"]* allows all characters except ending quotes. 这匹配以<span style=开头的字符串,后跟单引号或双引号'" 。然后[^'"]*允许除结束引号之外的所有字符。

Match string font-weight:bold , followed again by any amount of characters except ending qoutes leading up to the real ending qoutes and ending tag: [^'"]*['"]> . 匹配字符串font-weight:bold ,然后再匹配任意数量的字符,除了结束qoutes导致实际结束qoutes和结束标记: [^'"]*['"]>

(Note that you might or might not want to allow more attributes before and after the style attribute. In that case you need to alter the regex) (请注意,您可能会或可能不想在style属性之前和之后允许更多属性。在这种情况下,您需要更改正则表达式)

span may contain any amount of any characters except start tag < , then string has to end with closing </span> tag. span可以包含任何数量的任何字符,除了开始标记< ,然后字符串必须以结束</span>标记结尾。

remove the ^, because it means beginning of the line. 删除^,因为它表示行的开头。 Therefore it will always get the first span. 因此,它将始终获得第一个跨度。 More so because .* means (any characters at all). 之所以如此,是因为。*表示(根本没有任何字符)。

doing this the first match may stil be the output you have now, but the second match should be what you're after. 这样做,第一个匹配可能仍然是您现在的输出,但是第二个匹配应该是您所追求的。

Furthermore tools like regexbuddy and such are good for testing Regex's. 此外,诸如regexbuddy之类的工具也非常适合测试Regex。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM