正则表达式以匹配特定的html标签

Question

I need to match html tags(the whole tag), based on the tag name. 我需要根据标签名称匹配html标签（整个标签）。

For script tags I have this: 对于脚本标签，我有这个：

<script.+src=.+(\.js|\.axd).+(</script>|>)

It correctly matches both tags in the following html: 它正确匹配以下html中的两个标签：

<script src="Scripts/JScript1.js" type="text/javascript" />
<script type="text/javascript" src="Scripts/JScript2.js" />

However, when I do link tags with the following: 但是，当我使用以下内容链接标签时：

<link.+href=.+(\.css).+(</link>|>)

It matches all of this at once(eg it returns one match containing both items): 它一次匹配所有这些（例如，它返回一个包含两个项目的匹配项）：

<link href="Stylesheets/StyleSheet1.css" rel="Stylesheet" type="text/css" />
<link href="Stylesheets/StyleSheet2.css" rel="Stylesheet" type="text/css" />

What am I missing here? 我在这里想念什么？ The regexes are essentially identical except for the text to match to? 正则表达式本质上是相同的，除了要匹配的文本？

Also, I know that regex is not a great tool for HTML parsing...I will probably end up using the HtmlAgilityPack in the end, but this is driving me nuts and I want an answer if only for my own mental health! 另外，我知道正则表达式不是用于HTML解析的好工具...最终我可能最终会使用HtmlAgilityPack，但这使我发疯，如果我只是为了自己的心理健康，我想要一个答案！

Answer 1

The .+ wildcards match anything. 。+通配符匹配任何内容。 This: 这个：

<link.+href=.+(\.css).+(</link>|>)

Likely matches like this: 可能的匹配如下：

<link      => <link
.+         => href="Stylesheets/StyleSheet1.css" rel="Stylesheet" type="text/css" />
              <link 
 href=     => href=
 .+        => "Stylesheets/StyleSheet2
 \.css     => .css
 .+        => " rel="Stylesheet" type="text/css" /
 </link>|> => >

Instead consider using [^>]+ in place of .+. 而是考虑使用[^>] +代替。+。 Also, do you really care about the closing tag? 另外，您真的关心结束标记吗？

<link[^>]+href=[^>]+(\.css)[^>]+>

Answer 2

The problem is your regex is greedy. 问题是您的正则表达式贪婪。 Whenever you match .+ this is greedy; 每当你匹配.+这都是贪婪的； you need to make it non-greedy by appending a ? 您需要通过添加一个使其不贪婪? to them which makes it match a limited number of characters to satisfy the pattern and not go beyond it to the next matching string. 它们匹配有限数量的字符以满足模式，而不会超出它，直到下一个匹配的字符串。

Change the pattern to this: "<link.+?href=.+?(\\.css).+?(</link>|>)" 将模式更改为此： "<link.+?href=.+?(\\.css).+?(</link>|>)"

Then you'll need to use Regex.Matches to get multiple matches and loop over them. 然后，您需要使用Regex.Matches来获取多个匹配项并对其进行循环。

正则表达式以匹配特定的html标签

问题描述

2 个解决方案

解决方案1
2 已采纳 2011-01-16 18:51:30

解决方案2
1 2011-01-16 18:50:48

正则表达式以匹配特定的html标签

问题描述

2 个解决方案

解决方案1 2 已采纳 2011-01-16 18:51:30

解决方案2 1 2011-01-16 18:50:48

解决方案1
2 已采纳 2011-01-16 18:51:30

解决方案2
1 2011-01-16 18:50:48