正则表达式从HTML获取网址

Question

I'm using the following Regex (which I found online) to obtain the urls within a HTML page; 我正在使用以下正则表达式（我在网上找到）来获取HTML页面中的网址；

        Regex regex = new Regex(@"url\((?<char>['""])?(?<url>.*?)\k<char>?\)");

Works fine for the HTML below; 适用于以下HTML；

<div style="background:url(images/logo.png) no-repeat;">UK</div>

However returns more than I need when the HTML page contained the following Javascript, returning 'destpage' 但是，当HTML页面包含以下Javascript时，返回的内容超出了我的需要，返回“ destpage”

function buildurl(destpage)

I tried the following regex to include a colon, but it appears to be invalid 我尝试了以下正则表达式包含冒号，但它似乎无效

:url\((?<char>['""])?(?<:url>.*?)\k<char>?\)

Any help would be much appreciated. 任何帮助将非常感激。

Answer 1

To get all the URLs, use the HtmlAgilityPack instead of a Regex. 要获取所有URL，请使用HtmlAgilityPack而不是Regex。 From their example page 从他们的示例页面

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{

}

You can expand on that to obtain your style urls by, for example, using //@style to get the style nodes and iterating through those to extract the url value. 您可以对此进行扩展，例如通过使用//@style获取style节点，然后遍历style节点以提取url值来获取style url 。

Answer 2

Only add the colon to the front: 只将冒号添加到前面：

:url\((?<char>['""])?(?<url>.*?)\k<char>?\)

The second " url " is the name of that group. 第二个“ url ”是该组的名称。

正则表达式从HTML获取网址

问题描述

2 个解决方案

解决方案1
3 2013-08-28 15:01:10

解决方案2
0 2013-08-28 15:10:09

正则表达式从HTML获取网址

问题描述

2 个解决方案

解决方案1 3 2013-08-28 15:01:10

解决方案2 0 2013-08-28 15:10:09

解决方案1
3 2013-08-28 15:01:10

解决方案2
0 2013-08-28 15:10:09