简体   繁体   English

正则表达式以匹配特定的html标签

[英]regex to match specific html tags

I need to match html tags(the whole tag), based on the tag name. 我需要根据标签名称匹配html标签(整个标签)。

For script tags I have this: 对于脚本标签,我有这个:

<script.+src=.+(\.js|\.axd).+(</script>|>)

It correctly matches both tags in the following html: 它正确匹配以下html中的两个标签:

<script src="Scripts/JScript1.js" type="text/javascript" />
<script type="text/javascript" src="Scripts/JScript2.js" />

However, when I do link tags with the following: 但是,当我使用以下内容链接标签时:

<link.+href=.+(\.css).+(</link>|>)

It matches all of this at once(eg it returns one match containing both items): 它一次匹配所有这些(例如,它返回一个包含两个项目的匹配项):

<link href="Stylesheets/StyleSheet1.css" rel="Stylesheet" type="text/css" />
<link href="Stylesheets/StyleSheet2.css" rel="Stylesheet" type="text/css" />

What am I missing here? 我在这里想念什么? The regexes are essentially identical except for the text to match to? 正则表达式本质上是相同的,除了要匹配的文本?

Also, I know that regex is not a great tool for HTML parsing...I will probably end up using the HtmlAgilityPack in the end, but this is driving me nuts and I want an answer if only for my own mental health! 另外,我知道正则表达式不是用于HTML解析的好工具...最终我可能最终会使用HtmlAgilityPack,但这使我发疯,如果我只是为了自己的心理健康,我想要一个答案!

The .+ wildcards match anything. 。+通配符匹配任何内容。 This: 这个:

<link.+href=.+(\.css).+(</link>|>)

Likely matches like this: 可能的匹配如下:

<link      => <link
.+         => href="Stylesheets/StyleSheet1.css" rel="Stylesheet" type="text/css" />
              <link 
 href=     => href=
 .+        => "Stylesheets/StyleSheet2
 \.css     => .css
 .+        => " rel="Stylesheet" type="text/css" /
 </link>|> => >

Instead consider using [^>]+ in place of .+. 而是考虑使用[^>] +代替。+。 Also, do you really care about the closing tag? 另外,您真的关心结束标记吗?

<link[^>]+href=[^>]+(\.css)[^>]+>

The problem is your regex is greedy. 问题是您的正则表达式贪婪。 Whenever you match .+ this is greedy; 每当你匹配.+这都是贪婪的; you need to make it non-greedy by appending a ? 您需要通过添加一个使其不贪婪? to them which makes it match a limited number of characters to satisfy the pattern and not go beyond it to the next matching string. 它们匹配有限数量的字符以满足模式,而不会超出它,直到下一个匹配的字符串。

Change the pattern to this: "<link.+?href=.+?(\\.css).+?(</link>|>)" 将模式更改为此: "<link.+?href=.+?(\\.css).+?(</link>|>)"

Then you'll need to use Regex.Matches to get multiple matches and loop over them. 然后,您需要使用Regex.Matches来获取多个匹配项并对其进行循环。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM