在C＃中使用mshtml從HTML獲取href

Question

我正在嘗試在C＃（WPF）中使用mshtml從以下HTML代碼中獲取href鏈接。

<a class="button_link" href="https://rhystowey.com/account/confirm_email/2842S-B2EB5-136382?t=1&amp;sig=b0dbd522380a21007d8c375iuc583f46a90365d9&amp;iid=am-130280753913638201274485430&amp;ac=1&amp;uid=1284488216&amp;nid=18+308" style="border:none;color:#0084b4;text-decoration:none;color:#ffffff;font-size:13px;font-weight:bold;font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;">Confirm your account now</a>

我嘗試使用以下代碼通過在C＃（WPF）中使用mshtml來使此工作正常進行，但我失敗了。

HTMLDocument mdoc = (HTMLDocument)browser.Document;
string innerHtml = mdoc.body.outerText;
string str = "https://rhystowey.com/account/confirm_email/";
int index = innerHtml.IndexOf(str);
innerHtml = innerHtml.Remove(0, index + str.Length);
int startIndex = innerHtml.IndexOf("\"");
string str3 = innerHtml.Remove(startIndex, innerHtml.Length - startIndex);
string thelink = "https://rhystowey.com/account/confirm_email/" + str3;

有人可以幫我解決這個問題。

Answer 1

用這個：

var ex = new Regex("href=\"(.*)\" style");
var tag = "<a class=\"button_link\" href=\"https://rhystowey.com/account/confirm_email/2842S-B2EB5-136382?t=1&amp;sig=b0dbd522380a21007d8c375iuc583f46a90365d9&amp;iid=am-130280753913638201274485430&amp;ac=1&amp;uid=1284488216&amp;nid=18+308\" style=\"border:none;color:#0084b4;text-decoration:none;color:#ffffff;font-size:13px;font-weight:bold;font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;\">Confirm your account now</a>";

var address = ex.Match(tag).Groups[1].ToString();

但是您應該通過檢查來擴展它，因為例如Groups[1]可能超出范圍。

在你的例子中

HTMLDocument mdoc = (HTMLDocument)browser.Document;
string innerHtml = mdoc.body.outerText;
var ex = new Regex("href=\"([^\"\"]+)\"");
var address = ex.Match(innerHtml).Groups[1].ToString();

將匹配第一個href="..." 。 或者，您選擇所有出現的事件：

var matches = (from Match match in ex.Matches(innerHtml) select match.Groups[1].Value).ToList();

這將為您提供一個List<string>其中包含HTML中的所有鏈接。 要對此進行過濾，您可以采用這種方式

var wantedMatches = matches.Where(m => m.StartsWith("https://rhystowey.com/account/confirm_email/"));

這更加靈活，因為您可以檢查起始字符串列表或其他內容。 或者，您可以在正則表達式中執行此操作，這將導致更好的性能：

var ex = new Regex("href=\"(https://rhystowey\\.com/account/confirm_email/[^\"\"]+)\"");

就我所知，將所有內容組合在一起

var ex = new Regex("href=\"(https://rhystowey\\.com/account/confirm_email/[^\"\"]+)\"");
var matches = (from Match match in ex.Matches(innerHTML)
               where match.Groups.Count >= 1
               select match.Groups[1].Value).ToList();
var firstAddress = matches.FirstOrDefault();

如果存在鏈接，則firstAddress保留您的鏈接。

Answer 2

如果您的鏈接始終以相同的路徑開頭並且在頁面上沒有重復，則可以使用此（未經測試）：

    var match = Regex.Match(html, @"href=""(?<href>https\:\/\/rhystowey\.com\/account\/confirm_email\/[^""]+)""");

    if (match.Success)
    {
      var href = match.Groups["href"].Value;
      ....
    }

在C＃中使用mshtml從HTML獲取href

問題描述

2 個解決方案

解決方案1
1 已采納 2013-03-22 00:50:30

就我所知，將所有內容組合在一起

解決方案2
0 2013-03-22 00:57:40

在C＃中使用mshtml從HTML獲取href

問題描述

2 個解決方案

解決方案1 1 已采納 2013-03-22 00:50:30

就我所知，將所有內容組合在一起

解決方案2 0 2013-03-22 00:57:40

解決方案1
1 已采納 2013-03-22 00:50:30

解決方案2
0 2013-03-22 00:57:40