从 HTML 字符串中提取 HREF 值

Question

我正在尝试创建一个仅返回来自网站的链接的爬虫，并且我已经将其返回到 HTML 脚本。 我现在想使用 if 语句来检查字符串是否已返回，如果返回，它会搜索所有“< a >”标签并向我显示 href 链接。 但我不知道要检查什么对象或应该检查什么值。

这是我到目前为止所拥有的：

namespace crawler
{
    class Program
    {
        static void Main(string[] args)
        {
            System.Net.WebClient wc = new System.Net.WebClient();
            string WebData wc.DownloadString("https://www.abc.net.au/news/science/");
            Console.WriteLine(WebData);
            // if 
        }
    }        
}

Answer 1

你可以看看HTML Agility Pack：

然后，您可以从网页中找到所有链接，例如：

 var hrefs = new List<string>();
 var hw = new HtmlWeb();
 HtmlDocument document = hw.Load(/* your url here */);
 foreach(HtmlNode link in document.DocumentNode.SelectNodes("//a[@href]"))
 {
    HtmlAttribute attribute = link.Attributes["href"];

    if (!string.IsNullOrWhiteSpace(attribute.Value))
        hrefs.Add(attribute.Value);
 }

Answer 2

首先，您可以创建一个函数来返回整个网站的 HTML 代码，就像您所做的那样。 这是我有的！

public string GetPageContents()
{
    string link = "https://www.abc.net.au/news/science/"
    string pageContent = "";
    WebClient web = new WebClient();
    Stream stream;

    stream = web.OpenRead(link);
    using (StreamReader reader = new StreamReader(stream))
    {
        pageContent = reader.ReadToEnd();
    }
    stream.Close();

    return pageContents;
}

然后你可以创建一个函数来返回一个子字符串或一个子字符串列表（这意味着如果你想要所有的 < a > 标签，你可能会得到多个）。

List<string> divTags = GetBetweenTags(pageContents, "<div>", "</div>")

这将为您提供一个列表，例如，您可以在其中再次搜索每个 < div > 标签内的 < a > 标签。

public List<string> GetBetweenTags(string pageContents, string startTag, string endTag)
{
    Regex rx = new Regex(startTag + "(.*?)" + endTag);
    MatchCollection col = rx.Matches(value);

    List<string> tags = new List<string>();

    foreach(Match s in col)
        tags.Add(s.ToString());

    return tags;
}

编辑：哇不知道 HTML Agility Pack，谢谢@Gauravsa，我会更新我的项目以使用它！

从 HTML 字符串中提取 HREF 值

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-01-16 00:38:30

解决方案2
1 2019-01-16 00:32:46

从 HTML 字符串中提取 HREF 值

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-01-16 00:38:30

解决方案2 1 2019-01-16 00:32:46

解决方案1
2 已采纳 2019-01-16 00:38:30

解决方案2
1 2019-01-16 00:32:46