简体   繁体   English

正则表达式解析超链接和描述

[英]Regex to Parse Hyperlinks and Descriptions

C#: What is a good Regex to parse hyperlinks and their description? C#:解析超链接及其描述的优秀正则表达式是什么?

Please consider case insensitivity, white-space and use of single quotes (instead of double quotes) around the HREF tag. 请考虑不区分大小写,空白区域以及在HREF标记周围使用单引号(而不是双引号)。

Please also consider obtaining hyperlinks which have other tags within the <a> tags such as <b> and <i> . 另请考虑获取<a>标签中包含其他标签的超链接,例如<b><i>

As long as there are no nested tags (and no line breaks), the following variant works well: 只要没有嵌套标签(并且没有换行符),以下变体就可以正常工作:

<a\s+href=(?:"([^"]+)"|'([^']+)').*?>(.*?)</a>

As soon as nested tags come into play, regular expressions are unfit for parsing. 一旦嵌套标签发挥作用,正则表达式就不适合解析。 However, you can still use them by applying more advanced features of modern interpreters (depending on your regex machine). 但是,您仍然可以通过应用现代解释器的更高级功能(取决于您的正则表达式计算机)来使用它们。 Eg .NET regular expressions use a stack; 例如.NET正则表达式使用堆栈; I found this: 我找到了这个:

(?:<a.*?href=[""'](?<url>.*?)[""'].*?>)(?<name>(?><a[^<]*>(?<DEPTH>)|</a>(?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:</a>) 

Source: http://weblogs.asp.net/scottcate/archive/2004/12/13/281955.aspx 资料来源: http//weblogs.asp.net/scottcate/archive/2004/12/13/281955.aspx

See this example from StackOverflow: Regular expression for parsing links from a webpage? StackOverflow中查看此示例:用于解析网页链接的正则表达式?

Using The HTML Agility Pack you can parse the html, and extract details using the semantics of the HTML, instead of a broken regex. 使用HTML Agility Pack,您可以解析html,并使用HTML的语义提取详细信息,而不是破坏正则表达式。

I found this but apparently these guys had some problems with it. 我找到了这个,但显然这些家伙有一些问题。

Edit: (It works!) 编辑:( 它的工作原理!)
I have now done my own testing and found that it works, I don't know C# so I can't give you a C# answer but I do know PHP and here's the matches array I got back from running it on this: 我现在已经完成了自己的测试,发现它有效,我不知道C#所以我不能给你一个C#的答案,但我知道PHP,这里是我在运行它时得到的匹配数组:

<a href="pages/index.php" title="the title">Text</a>

array(3) { [0]=> string(52) "Text" [1]=> string(15) "pages/index.php" [2]=> string(4) "Text" } 

I have a regex that handles most cases, though I believe it does match HTML within a multiline comment. 有一个处理大多数情况的正则表达式 ,但我相信它在多行注释中匹配HTML。

It's written using the .NET syntax, but should be easily translatable. 它是使用.NET语法编写的,但应该很容易翻译。

Just going to throw this snippet out there now that I have it working..this is a less greedy version of one suggested earlier. 我现在就把这个片段扔到那里我已经有了它。这是一个不太贪婪的版本。 The original wouldnt work if the input had multiple hyperlinks. 如果输入有多个超链接,原始将无法工作。 This code below will allow you to loop through all the hyperlinks: 下面的代码将允许您遍历所有超链接:

static Regex rHref = new Regex(@"<a.*?href=[""'](?<url>[^""^']+[.]*?)[""'].*?>(?<keywords>[^<]+[.]*?)</a>", RegexOptions.IgnoreCase | RegexOptions.Compiled);
public void ParseHyperlinks(string html)
{
   MatchCollection mcHref = rHref.Matches(html);

   foreach (Match m in mcHref)
      AddKeywordLink(m.Groups["keywords"].Value, m.Groups["url"].Value);
}

Here is a regular expression that will match the balanced tags. 这是一个与平衡标签匹配的正则表达式。

(?:""'[""'].*?>)(?(?>(?)|(?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:) (????(>()|(< - DEPTH>)|)+)(?!(深度)()):(?? “” '[ “”'] *>)(?: )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM