如何使用正則表達式（REGEX-C＃）提取標簽鏈接

Question

到目前為止，我有：

<a href="(http://www.imdb.com/title/tt\d{7}/)".*?>.*?</a>

C＃

ArrayList imdbUrls = matchAll(@"<a href=""(http://www.imdb.com/title/tt\d{7}/)"".*?>.*?</a>", html);
private ArrayList matchAll(string regex, string html, int i = 0)
{
  ArrayList list = new ArrayList();
  foreach (Match m in new Regex(regex, RegexOptions.Multiline).Matches(html))
    list.Add(m.Groups[i].Value.Trim());
  return list;
}

我正在嘗試從HTML頁面提取imdb鏈接，此regex表達式有什么問題？

其主要思想是在Google中搜索電影，然后在結果中查找指向imdb的鏈接

Answer 1

正則表達式不是解析HTML文件的好選擇.HTML既不嚴格也不規范其格式。

使用htmlagilitypack 。您可以使用此代碼通過HtmlAgilityPack進行檢索

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

List<string> anchorImdbList = doc.DocumentNode.SelectNodes("//a[@href]")//this xpath selects all anchor tags
                  .Select(p => p.Attributes["href"].Value)
                  .Where(x=>Regex.IsMatch(x,@".*?www\.imdb\.com.*?"))
                  .Select(y=>y)
                  .ToList<string>();

Answer 2

嘗試這個：

string tag = "tag of the link";
string emptystring = Regex.Replace(tag, "<.*?>", string.Empty);

更新：

string emptystring = Regex.Replace(tag, @"<[^>]*>", string.Empty);

Answer 3

您必須轉義正斜杠。 嘗試：

<a href="(http:\/\/www.imdb.com\/title\/tt\d{7}\/)".*?>.*?<\/a>

如果您需要從復雜的頁面中解析html元素，則正則表達式將非常麻煩。 按照其他人的建議嘗試Html Agility Pack 。

如何使用正則表達式（REGEX-C＃）提取標簽鏈接

問題描述

3 個解決方案

解決方案1
1 已采納 2012-11-13 13:03:46

解決方案2
0 2012-11-13 12:50:22

解決方案3
0 2012-11-13 12:57:12

如何使用正則表達式（REGEX-C＃）提取標簽鏈接

問題描述

3 個解決方案

解決方案1 1 已采納 2012-11-13 13:03:46

解決方案2 0 2012-11-13 12:50:22

解決方案3 0 2012-11-13 12:57:12

解決方案1
1 已采納 2012-11-13 13:03:46

解決方案2
0 2012-11-13 12:50:22

解決方案3
0 2012-11-13 12:57:12