简体   繁体   English

用于查找<a>链接</a>的“href”值的正则表达式

[英]regular expression for finding 'href' value of a <a> link

I need a regex pattern for finding web page links in HTML.我需要一个正则表达式模式来查找 HTML 中的 web 页面链接。

I first use @"(<a.*?>.*?</a>)" to extract links ( <a> ), but I can't fetch href from that.我首先使用@"(<a.*?>.*?</a>)"来提取链接 ( <a> ),但我无法从中获取href

My strings are:我的字符串是:

  1. <a href="www.example.com/page.php?id=xxxx&name=yyyy"....></a>
  2. <a href="http://www.example.com/page.php?id=xxxx&name=yyyy"....></a>
  3. <a href="https://www.example.com/page.php?id=xxxx&name=yyyy"....></a>
  4. <a href="www.example.com/page.php/404"....></a>

1, 2 and 3 are valid and I need them, but number 4 is not valid for me ( ? and = is essential) 1、2 和 3 是有效的,我需要它们,但数字 4 对我无效( ?=是必需的)


Thanks everyone, but I don't need parsing <a> .谢谢大家,但我不需要解析<a> I have a list of links in href="abcdef" format.我有一个href="abcdef"格式的链接列表。

I need to fetch href of the links and filter it, my favorite urls must be contain ?我需要获取链接的href并对其进行过滤,我最喜欢的 url 必须包含? and = like page.php?id=5并且=喜欢page.php?id=5

Thanks!谢谢!

I'd recommend using an HTML parser over a regex, but still here's a regex that will create a capturing group over the value of the href attribute of each links.我建议在正则表达式上使用 HTML 解析器,但这里仍然是一个正则表达式,它将在每个链接的href属性的值上创建一个捕获组。 It will match whether double or single quotes are used.它将匹配使用双引号还是单引号。

<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1

You can view a full explanation of this regex at here .您可以在此处查看此正则表达式的完整说明。

Snippet playground:片段游乐场:

 const linkRx = /<a\\s+(?:[^>]*?\\s+)?href=(["'])(.*?)\\1/; const textToMatchInput = document.querySelector('[name=textToMatch]'); document.querySelector('button').addEventListener('click', () => { console.log(textToMatchInput.value.match(linkRx)); });
 <label> Text to match: <input type="text" name="textToMatch" value='<a href="google.com"'> <button>Match</button> </label>

Using regex to parse html is not recommended不推荐使用regex解析html

regex is used for regularly occurring patterns. regex用于定期出现的模式。 html is not regular with it's format(except xhtml ).For example html files are valid even if you don't have a closing tag !This could break your code. html的格式不规则( xhtml除外)。例如,即使您没有closing tag html文件也是有效的!这可能会破坏您的代码。

Use an html parser like htmlagilitypack使用像htmlagilitypack这样的 html 解析器

You can use this code to retrieve all href's in anchor tag using HtmlAgilityPack您可以使用此代码使用HtmlAgilityPack检索锚标记中的所有href's

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

var hrefList = doc.DocumentNode.SelectNodes("//a")
                  .Select(p => p.GetAttributeValue("href", "not found"))
                  .ToList();

hrefList contains all href`s hrefList包含所有的 href`s

Thanks everyone (specially @plalx)谢谢大家(特别是@plalx)

I find it quite overkill enforce the validity of the href attribute with such a complex and cryptic pattern while a simple expression such as我发现使用如此复杂和神秘的模式强制执行 href 属性的有效性而使用简单的表达式(例如
<a\\s+(?:[^>]*?\\s+)?href="([^"]*)"
would suffice to capture all URLs.足以捕获所有 URL。 If you want to make sure they contain at least a query string, you could just use如果你想确保它们至少包含一个查询字符串,你可以使用
<a\\s+(?:[^>]*?\\s+)?href="([^"]+\\?[^"]+)"


My final regex string:我的最终正则表达式字符串:


First use one of this: 首先使用其中一个:
 st = @"((www\\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\\\\\))+ \\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)"; st = @"<a href[^>]*>(.*?)</a>"; st = @"((([A-Za-z]{3,9}:(?:\\/\\/)?)(?:[-;:&=\\+\\$,\\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\\+\\$,\\w]+@)[A-Za-z0-9.-]+)((?:\\/[\\+~%\\/.\\w-_]*)?\\??(?:[-\\+=&;%@.\\w_]*)#?(?:[\\w]*))?)"; st = @"((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\\\\\)(?:www\\.)?|www\\.)[\\w\\d:#@%/;$()~_?\\+,\\-=\\\\.&]+)"; st = @"(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\\\\\)(?:www\\.)?|www\\.)"; st = @"(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\\\\\))+)|(www\\.)[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)"; st = @"href=[""'](?<url>(http|https)://[^/]*?\\.(com|org|net|gov))(/.*)?[""']"; st = @"(<a.*?>.*?</a>)"; st = @"(?:hrefs*=)(?:[s""']*)(?!#|mailto|location.|javascript|.*css|.*this.)(?.*?)(?:[s>""'])"; st = @"http://([\\\\w+?\\\\.\\\\w+])+([a-zA-Z0-9\\\\~\\\\!\\\\@\\\\#\\\\$\\\\%\\\\^\\\\&amp;\\\\*\\\\(\\\\)_\\\\-\\\\=\\\\+\\\\\\\\\\\\/\\\\?\\\\.\\\\:\\\\;\\\\'\\\\,]*)?"; st = @"http(s)?://([\\w-]+\\.)+[\\w-]+(/[\\w- ./?%&=]*)?"; st = @"(http|https)://([a-zA-Z0-9\\\\~\\\\!\\\\@\\\\#\\\\$\\\\%\\\\^\\\\&amp;\\\\*\\\\(\\\\)_\\\\-\\\\=\\\\+\\\\\\\\\\\\/\\\\?\\\\.\\\\:\\\\;\\\\'\\\\,]*)?"; st = @"((http|ftp|https):\\/\\/[\\w\\-_]+(\\.[\\w\\-_]+)+([\\w\\-\\.,@?^=%&amp;:/~\\+#]*[\\w\\-\\@?^=%&amp;/~\\+#])?)"; st = @"http://([\\\\w+?\\\\.\\\\w+])+([a-zA-Z0-9\\\\~\\\\!\\\\@\\\\#\\\\$\\\\%\\\\^\\\\&amp;\\\\*\\\\(\\\\)_\\\\-\\\\=\\\\+\\\\\\\\\\\\/\\\\?\\\\.\\\\:\\\\;\\\\'\\\\,]*)?"; st = @"http(s?)\\:\\/\\/[0-9a-zA-Z]([-.\\w]*[0-9a-zA-Z])*(:(0-9)*)*(\\/?)([a-zA-Z0-9\\-\\.\\?\\,\\'\\/\\\\\\+&amp;%\\$#_]*)?$"; st = @"(?<Protocol>\\w+):\\/\\/(?<Domain>[\\w.]+\\/?)\\S*";

my choice is我的选择是

@"(?<Protocol>\\w+):\\/\\/(?<Domain>[\\w.]+\\/?)\\S*"

Second Use this:第二使用这个:

 st = "(.*)?(.*)=(.*)";


Problem Solved.问题解决了。 Thanks every one :)感谢大家 :)

Try this :试试这个 :

 public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            var res = Find(html);
        }

        public static List<LinkItem> Find(string file)
        {
            List<LinkItem> list = new List<LinkItem>();

            // 1.
            // Find all matches in file.
            MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)",
                RegexOptions.Singleline);

            // 2.
            // Loop over each match.
            foreach (Match m in m1)
            {
                string value = m.Groups[1].Value;
                LinkItem i = new LinkItem();

                // 3.
                // Get href attribute.
                Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
                RegexOptions.Singleline);
                if (m2.Success)
                {
                    i.Href = m2.Groups[1].Value;
                }

                // 4.
                // Remove inner tags from text.
                string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
                RegexOptions.Singleline);
                i.Text = t;

                list.Add(i);
            }
            return list;
        }

        public struct LinkItem
        {
            public string Href;
            public string Text;

            public override string ToString()
            {
                return Href + "\n\t" + Text;
            }
        }

    }  

Input:输入:

  string html = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> 2.<a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> "; 

Result:结果:

[0] = {www.aaa.xx/xx.zz?id=xxxx&name=xxxx}
[1] = {http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx}

C# Scraping HTML Links C# 抓取 HTML 链接

Scraping HTML extracts important page elements.抓取 HTML 提取重要的页面元素。 It has many legal uses for webmasters and ASP.NET developers.它对网站管理员和 ASP.NET 开发人员有许多合法用途。 With the Regex type and WebClient, we implement screen scraping for HTML.使用 Regex 类型和 WebClient,我们为 HTML 实现屏幕抓取。

Edited已编辑

Another easy way:you can use a web browser control for getting href from tag a ,like this:(see my example)另一种简单的方法:您可以使用web browser控件从标签a获取href ,如下所示:(参见我的示例)

 public Form1()
        {
            InitializeComponent();
            webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            webBrowser1.DocumentText = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"https://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"www.aaa.xx/xx.zz/xxx\" ....></a>";
        }

        void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            List<string> href = new List<string>();
            foreach (HtmlElement el in webBrowser1.Document.GetElementsByTagName("a"))
            {
                href.Add(el.GetAttribute("href"));
            }
        }

Try this regex:试试这个正则表达式:

"href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))"

You will get more help from discussions over:您将从以下方面的讨论中获得更多帮助:

Regular expression to extract URL from an HTML link 从 HTML 链接中提取 URL 的正则表达式

and

Regex to get the link in href. 正则表达式以获取 href 中的链接。 [asp.net] [asp.net]

Hope its helpful.希望它有帮助。

 HTMLDocument DOC = this.MySuperBrowser.Document as HTMLDocument;
 public IHTMLAnchorElement imageElementHref;
 imageElementHref = DOC.getElementById("idfirsticonhref") as IHTMLAnchorElement;

Simply try this code只需尝试此代码

I came up with this one, that supports anchor and image tags, and supports single and double quotes.我想出了这个,支持锚点和图像标签,并支持单引号和双引号。

<[a|img]+\\s+(?:[^>]*?\\s+)?[src|href]+=[\"']([^\"']*)['\"]

So所以

<a href="/something.ext">click here</a>

Will match:将匹配:

 Match 1: /something.ext

And

<a href='/something.ext'>click here</a>

Will match:将匹配:

 Match 1: /something.ext

Same goes for img src attributes img src 属性也是如此

I took a much simpler approach.我采用了一种更简单的方法。 This one simply looks for href attributes, and captures the value (between apostrophes) trailing it into a group named url:这个只是寻找 href 属性,并将其后面的值(在撇号之间)捕获到名为 url 的组中:

href=['"](?<url>.*?)['"]

I think in this case it is one of the simplest pregmatches我认为在这种情况下,它是最简单的预匹配之一

/<a\s*(.*?id[^"]*")/g

gets links with the variable id in the address获取地址中变量id的链接

starts from href including it, gets all characters/signs (. - excluding new line signs) until first id occur, including it, and next all signs to nearest next " sign ([^"]*)href开始,包括它,获取所有字符/符号(。 - 不包括新行符号)直到第一个id出现,包括它,然后所有符号到最近的下一个“符号([^”] *)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM