假设我已将网页源存储在字符串变量中，如何阅读C＃中的HTML文档？

Question

I have tried to do this on my own but couldn't. 我已经尝试过自己做，但是不能。

I have an html document, and I'm trying to extract the addresses for all the pictures in it into ac# collection and I'm not sure of the syntax. 我有一个html文档，并且尝试将其中所有图片的地址提取到ac＃集合中，但不确定语法。 I'm using HTMLAgilityPack... Here is what I have so far. 我正在使用HTMLAgilityPack ...这是到目前为止的内容。 Please advise. 请指教。

The HTML Code is the following: HTML代码如下：

<div style='padding-left:12px;' id='myWeb123'>
<b>MyWebSite Pics</b>
<br /><br />
<img src="http://myWebSite.com/pics/HHTR_01.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_02.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_03.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_04.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_05.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_06.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_07.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_08.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_09.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_10.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<a href="http://www.myWebSite.com/" target="_blank" rel="nofollow">Source</a>
</div>

And the c# code is the following: C＃代码如下：

HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();

document.Load("FileName.html");

// Targets a specific node
HtmlNode someNode = document.GetElementbyId("myWeb123");

//HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//a[@href]");

HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//div[@id='myWeb123']");

if (linkNodes != null)
{
    int count = 0;
    foreach(HtmlNode linkNode in linkNodes)
    {

        string linkTitle = linkNode.GetAttributeValue("src", string.Empty);

        Debug.Print("linkTitle = " + linkTitle);

        if (linkTitle == string.Empty)
        {
            HtmlNode imageNode = linkNode.SelectSingleNode("img[@alt]");
            if (imageNode != null)
            {
                Debug.Print("imageNode = " + imageNode.Attributes.ToString());
            }
        }
        count++;
        Debug.Print("count = " + count);
    }
}

I tried to use the HtmlAgilityPack Documentation but this pack lacks examples and the information about its methods and classes are really hard for me to understand without examples. 我尝试使用HtmlAgilityPack文档，但是此包缺少示例，如果没有示例，我很难理解有关其方法和类的信息。

Answer 1

try this, sorry if it will not be buildable, I have overwritten our code to your situation 试试这个，对不起，如果它无法构建，我已经根据您的情况覆盖了我们的代码

List<string> result = new List<string>();
foreach (HtmlNode link in document.DocumentNode.SelectNodes("//img[@src]"))
{
    HtmlAttribute att = link.Attributes["src"];

    string temp = att.Value;
    string urlValue;
    do
    {
        urlValue = temp;
        temp = HttpUtility.UrlDecode(HttpUtility.HtmlDecode(urlValue));
    } while (temp != urlValue);

    result.Add(temp);
}

Answer 2

You can use the overload of Load which takes a TextReader : 您可以使用需要TextReader的Load重载：

document.Load(new StringReader(text));

(I haven't looked over the rest of the code, but that addresses the "what do I do if I've already got the HTML in a string?" part.) （我没有看完其余的代码，但这解决了“如果我已经将HTML放在字符串中该怎么办？”部分。）

Answer 3

In this line: 在这一行：

HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//div[@id='myWeb123']");

you are selecting the <div> node, not the <img> nodes under it. 您选择的是<div>节点，而不是其下面的<img>节点。 Try this to select those img nodes: 尝试选择这些img节点：

HtmlNodeCollection linkNodes = document.DocumentNode
     .SelectNodes("//div[@id='myWeb123']/img");

As for the selection syntax, it's identical to XPath as used in XML. 至于选择语法，它与XML中使用的XPath相同。 So search for XPath if you want examples of the selection. 因此，如果需要选择示例，请搜索XPath。

In this case: 在这种情况下：

the leading / starts searching from the root of the document (instead of from some "currect node") 前导/从文档的根目录开始搜索（而不是从某些“ curect节点”开始搜索）
the // means that the next match can be at any depth instead of directly under the root //表示下一个匹配项可以位于任意深度，而不是直接位于根目录下
div[@id='myWeb123'] searches for a <div> node with an attribute 'id' that has value 'myWeb123' div[@id='myWeb123']搜索具有值'myWeb123'的属性'id'的<div>节点
the /img searches for an img node directly under the matched div node. /img在匹配的div节点下直接搜索img节点。

Answer 4

Using Xpath like this will be expensive if the page size grows. 如果页面大小增加，使用这样的Xpath将会很昂贵。 Best is to deserialize the html to an object. 最好是将html反序列化为对象。 You also dont need to use the Htmlagility reference that you are using. 您也不需要使用正在使用的Htmlagility参考。 Load the HTML using streamreader and the use Xmlserializer Use XSD tool , first to convert to xsd and then generate a class from the xsd tool 使用streamreader和使用Xmlserializer使用XSD工具加载HTML，首先将其转换为xsd，然后从xsd工具生成一个类

1)
C:\Program Files\Microsoft Visual Studio 9.0\VC>xsd /c /language:CS c:\xtest.xml

Microsoft (R) Xml Schemas/DataTypes support utility
[Microsoft (R) .NET Framework, Version 2.0.50727.3038]
Copyright (C) Microsoft Corporation. All rights reserved.
Writing file 'C:\Program Files\Microsoft Visual Studio 9.0\VC\xtest.xsd'.

2)
C:\Program Files\Microsoft Visual Studio 9.0\VC>xsd /c  xtest.xsd
Microsoft (R) Xml Schemas/DataTypes support utility
[Microsoft (R) .NET Framework, Version 2.0.50727.3038]
Copyright (C) Microsoft Corporation. All rights reserved.
Writing file 'C:\Program Files\Microsoft Visual Studio 9.0\VC\xtest.cs'.

Import this class to your solution 将此类导入您的解决方案

html col = new html();
StreamReader reader = new StreamReader("c:\\test.html"); 
XmlSerializer ser = new XmlSerializer(typeof(html));
col = (html)ser.Deserialize(reader);

The col object then will contain all the src of the img tags in one shot. 然后col对象将一枪包含img标签的所有src。

假设我已将网页源存储在字符串变量中，如何阅读C＃中的HTML文档？

问题描述

4 个解决方案

解决方案1
3 已采纳 2011-11-25 08:57:27

解决方案2
2 2011-11-25 08:53:48

解决方案3
0 2011-11-25 09:14:35

解决方案4
0 2011-11-25 10:03:53

假设我已将网页源存储在字符串变量中，如何阅读C＃中的HTML文档？

问题描述

4 个解决方案

解决方案1 3 已采纳 2011-11-25 08:57:27

解决方案2 2 2011-11-25 08:53:48

解决方案3 0 2011-11-25 09:14:35

解决方案4 0 2011-11-25 10:03:53

解决方案1
3 已采纳 2011-11-25 08:57:27

解决方案2
2 2011-11-25 08:53:48

解决方案3
0 2011-11-25 09:14:35

解决方案4
0 2011-11-25 10:03:53