[英]regex to match specific html tags
I need to match html tags(the whole tag), based on the tag name. 我需要根据标签名称匹配html标签(整个标签)。
For script tags I have this: 对于脚本标签,我有这个:
<script.+src=.+(\.js|\.axd).+(</script>|>)
It correctly matches both tags in the following html: 它正确匹配以下html中的两个标签:
<script src="Scripts/JScript1.js" type="text/javascript" />
<script type="text/javascript" src="Scripts/JScript2.js" />
However, when I do link tags with the following: 但是,当我使用以下内容链接标签时:
<link.+href=.+(\.css).+(</link>|>)
It matches all of this at once(eg it returns one match containing both items): 它一次匹配所有这些(例如,它返回一个包含两个项目的匹配项):
<link href="Stylesheets/StyleSheet1.css" rel="Stylesheet" type="text/css" />
<link href="Stylesheets/StyleSheet2.css" rel="Stylesheet" type="text/css" />
What am I missing here? 我在这里想念什么? The regexes are essentially identical except for the text to match to?
正则表达式本质上是相同的,除了要匹配的文本?
Also, I know that regex is not a great tool for HTML parsing...I will probably end up using the HtmlAgilityPack in the end, but this is driving me nuts and I want an answer if only for my own mental health! 另外,我知道正则表达式不是用于HTML解析的好工具...最终我可能最终会使用HtmlAgilityPack,但这使我发疯,如果我只是为了自己的心理健康,我想要一个答案!
The .+ wildcards match anything. 。+通配符匹配任何内容。 This:
这个:
<link.+href=.+(\.css).+(</link>|>)
Likely matches like this: 可能的匹配如下:
<link => <link
.+ => href="Stylesheets/StyleSheet1.css" rel="Stylesheet" type="text/css" />
<link
href= => href=
.+ => "Stylesheets/StyleSheet2
\.css => .css
.+ => " rel="Stylesheet" type="text/css" /
</link>|> => >
Instead consider using [^>]+ in place of .+. 而是考虑使用[^>] +代替。+。 Also, do you really care about the closing tag?
另外,您真的关心结束标记吗?
<link[^>]+href=[^>]+(\.css)[^>]+>
The problem is your regex is greedy. 问题是您的正则表达式贪婪。 Whenever you match
.+
this is greedy; 每当你匹配
.+
这都是贪婪的; you need to make it non-greedy by appending a ?
您需要通过添加一个使其不贪婪
?
to them which makes it match a limited number of characters to satisfy the pattern and not go beyond it to the next matching string. 它们匹配有限数量的字符以满足模式,而不会超出它,直到下一个匹配的字符串。
Change the pattern to this: "<link.+?href=.+?(\\.css).+?(</link>|>)"
将模式更改为此:
"<link.+?href=.+?(\\.css).+?(</link>|>)"
Then you'll need to use Regex.Matches
to get multiple matches and loop over them. 然后,您需要使用
Regex.Matches
来获取多个匹配项并对其进行循环。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.