[英]Regex to get url from HTML
I'm using the following Regex (which I found online) to obtain the urls within a HTML page; 我正在使用以下正则表达式(我在网上找到)来获取HTML页面中的网址;
Regex regex = new Regex(@"url\((?<char>['""])?(?<url>.*?)\k<char>?\)");
Works fine for the HTML below; 适用于以下HTML;
<div style="background:url(images/logo.png) no-repeat;">UK</div>
However returns more than I need when the HTML page contained the following Javascript, returning 'destpage' 但是,当HTML页面包含以下Javascript时,返回的内容超出了我的需要,返回“ destpage”
function buildurl(destpage)
I tried the following regex to include a colon, but it appears to be invalid 我尝试了以下正则表达式包含冒号,但它似乎无效
:url\((?<char>['""])?(?<:url>.*?)\k<char>?\)
Any help would be much appreciated. 任何帮助将非常感激。
To get all the URLs, use the HtmlAgilityPack instead of a Regex. 要获取所有URL,请使用HtmlAgilityPack而不是Regex。 From their example page
从他们的示例页面
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
}
You can expand on that to obtain your style urls by, for example, using //@style
to get the style
nodes and iterating through those to extract the url
value. 您可以对此进行扩展,例如通过使用
//@style
获取style
节点,然后遍历style
节点以提取url
值来获取style
url
。
Only add the colon to the front: 只将冒号添加到前面:
:url\((?<char>['""])?(?<url>.*?)\k<char>?\)
The second " url
" is the name of that group. 第二个“
url
”是该组的名称。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.