简体   繁体   English

使用htmlagility的C#抓取网址

[英]C# grab urls using htmlagility

Okay so I have this list of URLs on this webpage, I am wondering how do I grab the URLs and add them to a ArrayList? 好的,所以我在此网页上有此URL列表,我想知道如何获取URL并将其添加到ArrayList?

http://www.animenewsnetwork.com/encyclopedia/anime.php?list=A http://www.animenewsnetwork.com/encyclopedia/anime.php?list=A

I only want the URLs which are in the list, look at it to see what I mean. 我只想要列表中的URL,请看一下它的意思。 I tried doing it myself and for whatever reason it takes all of the other URLs except for the ones I need. 我尝试自己进行操作,无论出于何种原因,它都会占用我需要的其他所有URL。

   http://pastebin.com/a7hJnXPP

Using Html Agility Pack 使用HTML Agility Pack

using (var wc = new WebClient())
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(wc.DownloadString("http://www.animenewsnetwork.com/encyclopedia/anime.php?list=A"));
    var links = doc.DocumentNode.SelectSingleNode("//div[@class='lst']")
        .Descendants("a")
        .Select(x => x.Attributes["href"].Value)
        .ToArray();
}

If you want only the ones in the list, then the following code should work (this is assuming you have the page loaded into an HtmlDocument already) 如果只需要列表中的内容,则以下代码应该起作用(这是假定您已经将页面加载到HtmlDocument

List<string> hrefList = new List<string>(); //Make a list cause lists are cool.

foreach (HtmlNode node animePage.DocumentNode.SelectNodes("//a[contains(@href, 'id=')]"))
{
    //Append animenewsnetwork.com to the beginning of the href value and add it
    // to the list.
    hrefList.Add("http://www.animenewsnetwork.com" + node.GetAttributeValue("href", "null"));
}

//a[contains(@href, 'id=')] Breaking this XPath down as follows: //a[contains(@href, 'id=')]将此XPath分解如下:

  • //a Select all <a> nodes... //a选择所有<a>节点...
  • [contains(@href, 'id=')] ... that contain an href attribute that contains the text id= . [contains(@href, 'id=')] href [contains(@href, 'id=')] ...包含包含文本id=href属性。

That should be enough to get you going. 那应该足以使您前进。

As an aside, I would suggest not listing each link in its own messagebox considering there are around 500 links on that page. 顺便说一句,考虑到该页面上大约有500个链接,我建议不要在其自己的消息框中列出每个链接。 500 links = 500 messageboxes :( 500个链接= 500个消息框:(

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM