简体   繁体   English

使用html agility pack从c#中的html中提取图像URL并将它们写入xml文件中

[英]Extracting images urls from html in c# using html agility pack and writing them in a xml file

I am new to c# and I really need help with the following problem. 我是c#的新手,我真的需要帮助解决以下问题。 I wish to extract the photos urls from a webpage that have a specific pattern. 我希望从具有特定模式的网页中提取照片网址。 For example I wish to extract all the images that have the following pattern name_412s.jpg. 例如,我希望提取具有以下模式name_412s.jpg的所有图像。 I use the following code to extract images from html, but I do not kow how to adapt it. 我使用以下代码从html中提取图像,但我不知道如何调整它。

public void Images()
    {
        WebClient x = new WebClient();
        string source = x.DownloadString(@"http://www.google.com");

        HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
        document.Load(source);

        foreach(HtmlNode link in document.DocumentElement.SelectNodes("//img")
        {
          images[] = link["src"];
       }
}

I also need to write the results in a xml file. 我还需要在xml文件中写入结果。 Can you also help me with that? 你能帮帮我吗?

Thank you ! 谢谢 !

To limit the query results, you need to add a condition to your XPath. 要限制查询结果,需要向XPath添加条件。 For instance, //img[contains(@src, 'name_412s.jpg')] will limit the results to only img elements that have an src attribute that contains that file name. 例如, //img[contains(@src, 'name_412s.jpg')]会将结果限制为仅包含具有包含该文件名的src属性的img元素。

As far as writing out the results to XML, you'll need to create a new XML document and then copy the matching elements into it. 至于将结果写入XML,您需要创建一个新的XML文档,然后将匹配的元素复制到其中。 Since you won't be able to directly import an HtmlAgilityPack node into an XmlDocument, you'll have to manually copy all the attributes. 由于您无法将HtmlAgilityPack节点直接导入XmlDocument,因此您必须手动复制所有属性。 For instance: 例如:

using System.Net;
using System.Xml;

// ...

public void Images()
{
    WebClient x = new WebClient();
    string source = x.DownloadString(@"http://www.google.com");
    HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
    document.Load(source);
    XmlDocument output = new XmlDocument();
    XmlElement imgElements = output.CreateElement("ImgElements");
    output.AppendChild(imgElements);
    foreach(HtmlNode link in document.DocumentElement.SelectNodes("//img[contains(@src, '_412s.jpg')]")
    {
        XmlElement img = output.CreateElement(link.Name);
        foreach(HtmlAttribute a in link.Attributes)
        {
            img.SetAttribute(a.Name, a.Value)
        }
        imgElements.AppendChild(img);
    }
    output.Save(@"C:\test.xml");
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM