简体   繁体   中英

Get href from html using mshtml in C#

I am trying to get the href link out of the following HTML code using mshtml in C# (WPF).

<a class="button_link" href="https://rhystowey.com/account/confirm_email/2842S-B2EB5-136382?t=1&amp;sig=b0dbd522380a21007d8c375iuc583f46a90365d9&amp;iid=am-130280753913638201274485430&amp;ac=1&amp;uid=1284488216&amp;nid=18+308" style="border:none;color:#0084b4;text-decoration:none;color:#ffffff;font-size:13px;font-weight:bold;font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;">Confirm your account now</a>

I have tried using the following code to make this work by using mshtml in C# (WPF) but I have failed miserably.

HTMLDocument mdoc = (HTMLDocument)browser.Document;
string innerHtml = mdoc.body.outerText;
string str = "https://rhystowey.com/account/confirm_email/";
int index = innerHtml.IndexOf(str);
innerHtml = innerHtml.Remove(0, index + str.Length);
int startIndex = innerHtml.IndexOf("\"");
string str3 = innerHtml.Remove(startIndex, innerHtml.Length - startIndex);
string thelink = "https://rhystowey.com/account/confirm_email/" + str3;

Can someone please help me to get this to work.

Use this:

var ex = new Regex("href=\"(.*)\" style");
var tag = "<a class=\"button_link\" href=\"https://rhystowey.com/account/confirm_email/2842S-B2EB5-136382?t=1&amp;sig=b0dbd522380a21007d8c375iuc583f46a90365d9&amp;iid=am-130280753913638201274485430&amp;ac=1&amp;uid=1284488216&amp;nid=18+308\" style=\"border:none;color:#0084b4;text-decoration:none;color:#ffffff;font-size:13px;font-weight:bold;font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;\">Confirm your account now</a>";

var address = ex.Match(tag).Groups[1].ToString();

But you should extend it with checks because for instance Groups[1] could be out of range.

In your example

HTMLDocument mdoc = (HTMLDocument)browser.Document;
string innerHtml = mdoc.body.outerText;
var ex = new Regex("href=\"([^\"\"]+)\"");
var address = ex.Match(innerHtml).Groups[1].ToString();

will match the first href="..." . Or you select all occurrences:

var matches = (from Match match in ex.Matches(innerHtml) select match.Groups[1].Value).ToList();

This will give you a List<string> with all the links in your HTML. To filter this, you can either go this way

var wantedMatches = matches.Where(m => m.StartsWith("https://rhystowey.com/account/confirm_email/"));

which is more flexible because you could check against a list of start strings or whatever. Or you do it in your regex, which will lead in better performance:

var ex = new Regex("href=\"(https://rhystowey\\.com/account/confirm_email/[^\"\"]+)\"");

Bringing it all together to what you want as far as I understand

var ex = new Regex("href=\"(https://rhystowey\\.com/account/confirm_email/[^\"\"]+)\"");
var matches = (from Match match in ex.Matches(innerHTML)
               where match.Groups.Count >= 1
               select match.Groups[1].Value).ToList();
var firstAddress = matches.FirstOrDefault();

firstAddress holds your link, if there is one.

If your link will always start with the same path and isn't repeated on the page, you can use this (untested):

    var match = Regex.Match(html, @"href=""(?<href>https\:\/\/rhystowey\.com\/account\/confirm_email\/[^""]+)""");

    if (match.Success)
      var href = match.Groups["href"].Value;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM