简体   繁体   English

正则表达式#从<a>标签中</a>提取url

[英]regex c# extracting url from <a> tag

I am trying to extract URL from an tag, however, instead of getting https://website.com/-id1 , I am getting tag link text. 我试图从标记中提取URL,但是,我得到的是标记链接文本,而不是获取https://website.com/-id1 Here is my code: 这是我的代码:

string text="<a style=\"font - weight: bold; \" href=\"https://website.com/-id1\">MyLink</a>";

 string parsed = Regex.Replace(text, " <[^>] + href =\"([^\"]+)\"[^>]*>", "$1 " );

    parsed = Regex.Replace(parsed, "<[^>]+>", "");

    Console.WriteLine(parsed);

The result I got was MyLink which is not what I want. 我得到的结果是MyLink ,这不是我想要的。 I want something like 我想要类似的东西

https://website.com/-id1

Any help or a link will be highly appreciated. 任何帮助或链接将受到高度赞赏。

Regular expressions can be used in very specific, simple cases with HTML. 正则表达式可以在HTML的非常具体,简单的情况下使用。 For example, if the text contains only a single tag, you can use "href\\\\s*=\\\\s*\\"(?<url>.*?)\\"" to extract the URL, eg: 例如,如果文本包含单个标记,则可以使用"href\\\\s*=\\\\s*\\"(?<url>.*?)\\""来提取URL,例如:

var url=Regex.Match(text,"href\\s*=\\s*\"(?<url>.*?)\"").Groups["url"].Value;

This pattern will return : 这种模式将返回:

https://website.com/-id1

This regex doesn't do anything fancy. 这个正则表达式没有任何花哨的东西。 It looks for href= with possible whitespace and then captures anything between the first double quote and the next in a non-greedy manner ( .*? ). 它查找带有可能空格的href=然后以非贪婪的方式( .*? )捕获第一个双引号和下一个双引号之间的任何内容。 This is captured in the named group url . 这是在命名组url捕获的。

Anything more fancy and things get very complex. 任何更奇特的东西都变得非常复杂。 For example, supporting both single and double quotes would require special handling to avoid starting on a single and ending on a double quote. 例如,支持单引号和双引号将需要特殊处理以避免在单引号上开始并以双引号结束。 The string could multiple <a> tags that used both types of quotes. 该字符串可以使用两种类型的引号的多个<a>标签。

For complex parsing it would be better to use a library like AngleSharp or HtmlAgilityPack 对于复杂的解析,最好使用像AngleSharpHtmlAgilityPack这样的库

Try this: 尝试这个:

var input = "<a style=\"font - weight: bold; \" href=\"https://website.com/-id1\">MyLink</a><a style=\"font - weight: bold; \" href=\"https://website.com/-id2\">MyLink2</a>";
var r = new Regex("<a.*?href=\"(.*?)\".*?>");
var output = r.Matches(input);
var urls = new List<string>();
foreach (var item in output) {
    urls.Add((item as Match).Groups[1].Value);
}

It will find all a tags and extract their href values then store it in urls List. 它将找到所有标签并提取其href值,然后将其存储在URL列表中。

Explanation 说明

<a match begining of <a> tag <a a匹配<a>标签的开头
.*?href= match anything until href= .*?href=匹配任何东西,直到href =
"(.*?)" match and capture anything inside "" "(.*?)"匹配并捕获任何内部“”
.*?> match end of <a> tag .*?>匹配<a>标签的结尾

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM