简体   繁体   English

假设我已将网页源存储在字符串变量中,如何阅读C#中的HTML文档?

[英]How do I read HTML Document in C# given that I have the webpage source stored in a string variable?

I have tried to do this on my own but couldn't. 我已经尝试过自己做,但是不能。

I have an html document, and I'm trying to extract the addresses for all the pictures in it into ac# collection and I'm not sure of the syntax. 我有一个html文档,并且尝试将其中所有图片的地址提取到ac#集合中,但不确定语法。 I'm using HTMLAgilityPack... Here is what I have so far. 我正在使用HTMLAgilityPack ...这是到目前为止的内容。 Please advise. 请指教。

The HTML Code is the following: HTML代码如下:

<div style='padding-left:12px;' id='myWeb123'>
<b>MyWebSite Pics</b>
<br /><br />
<img src="http://myWebSite.com/pics/HHTR_01.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_02.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_03.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_04.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_05.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_06.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_07.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_08.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_09.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_10.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<a href="http://www.myWebSite.com/" target="_blank" rel="nofollow">Source</a>
</div>

And the c# code is the following: C#代码如下:

HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();

document.Load("FileName.html");

// Targets a specific node
HtmlNode someNode = document.GetElementbyId("myWeb123");

//HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//a[@href]");

HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//div[@id='myWeb123']");

if (linkNodes != null)
{
    int count = 0;
    foreach(HtmlNode linkNode in linkNodes)
    {

        string linkTitle = linkNode.GetAttributeValue("src", string.Empty);

        Debug.Print("linkTitle = " + linkTitle);

        if (linkTitle == string.Empty)
        {
            HtmlNode imageNode = linkNode.SelectSingleNode("img[@alt]");
            if (imageNode != null)
            {
                Debug.Print("imageNode = " + imageNode.Attributes.ToString());
            }
        }
        count++;
        Debug.Print("count = " + count);
    }
}

I tried to use the HtmlAgilityPack Documentation but this pack lacks examples and the information about its methods and classes are really hard for me to understand without examples. 我尝试使用HtmlAgilityPack文档,但是此包缺少示例,如果没有示例,我很难理解有关其方法和类的信息。

try this, sorry if it will not be buildable, I have overwritten our code to your situation 试试这个,对不起,如果它无法构建,我已经根据您的情况覆盖了我们的代码

List<string> result = new List<string>();
foreach (HtmlNode link in document.DocumentNode.SelectNodes("//img[@src]"))
{
    HtmlAttribute att = link.Attributes["src"];

    string temp = att.Value;
    string urlValue;
    do
    {
        urlValue = temp;
        temp = HttpUtility.UrlDecode(HttpUtility.HtmlDecode(urlValue));
    } while (temp != urlValue);

    result.Add(temp);
}

You can use the overload of Load which takes a TextReader : 您可以使用需要TextReaderLoad重载:

document.Load(new StringReader(text));

(I haven't looked over the rest of the code, but that addresses the "what do I do if I've already got the HTML in a string?" part.) (我没有看完其余的代码,但这解决了“如果我已经将HTML放在字符串中该怎么办?”部分。)

In this line: 在这一行:

HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//div[@id='myWeb123']");

you are selecting the <div> node, not the <img> nodes under it. 您选择的是<div>节点,而不是其下面的<img>节点。 Try this to select those img nodes: 尝试选择这些img节点:

HtmlNodeCollection linkNodes = document.DocumentNode
     .SelectNodes("//div[@id='myWeb123']/img");

As for the selection syntax, it's identical to XPath as used in XML. 至于选择语法,它与XML中使用的XPath相同。 So search for XPath if you want examples of the selection. 因此,如果需要选择示例,请搜索XPath。

In this case: 在这种情况下:

  • the leading / starts searching from the root of the document (instead of from some "currect node") 前导/从文档的根目录开始搜索(而不是从某些“ curect节点”开始搜索)
  • the // means that the next match can be at any depth instead of directly under the root //表示下一个匹配项可以位于任意深度,而不是直接位于根目录下
  • div[@id='myWeb123'] searches for a <div> node with an attribute 'id' that has value 'myWeb123' div[@id='myWeb123']搜索具有值'myWeb123'的属性'id'的<div>节点
  • the /img searches for an img node directly under the matched div node. /img在匹配的div节点下直接搜索img节点。

Using Xpath like this will be expensive if the page size grows. 如果页面大小增加,使用这样的Xpath将会很昂贵。 Best is to deserialize the html to an object. 最好是将html反序列化为对象。 You also dont need to use the Htmlagility reference that you are using. 您也不需要使用正在使用的Htmlagility参考。 Load the HTML using streamreader and the use Xmlserializer Use XSD tool , first to convert to xsd and then generate a class from the xsd tool 使用streamreader和使用Xmlserializer使用XSD工具加载HTML,首先将其转换为xsd,然后从xsd工具生成一个类

1)
C:\Program Files\Microsoft Visual Studio 9.0\VC>xsd /c /language:CS c:\xtest.xml

Microsoft (R) Xml Schemas/DataTypes support utility
[Microsoft (R) .NET Framework, Version 2.0.50727.3038]
Copyright (C) Microsoft Corporation. All rights reserved.
Writing file 'C:\Program Files\Microsoft Visual Studio 9.0\VC\xtest.xsd'.

2)
C:\Program Files\Microsoft Visual Studio 9.0\VC>xsd /c  xtest.xsd
Microsoft (R) Xml Schemas/DataTypes support utility
[Microsoft (R) .NET Framework, Version 2.0.50727.3038]
Copyright (C) Microsoft Corporation. All rights reserved.
Writing file 'C:\Program Files\Microsoft Visual Studio 9.0\VC\xtest.cs'.

Import this class to your solution 将此类导入您的解决方案

html col = new html();
StreamReader reader = new StreamReader("c:\\test.html"); 
XmlSerializer ser = new XmlSerializer(typeof(html));
col = (html)ser.Deserialize(reader); 

The col object then will contain all the src of the img tags in one shot. 然后col对象将一枪包含img标签的所有src。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从C#的数据表中的给定HTML文档中读取特定数据? - How can I read specific data from given html document in Data Table in c#? 如何在C#中获取网页的完整源代码? - How do I get the full source code of a webpage in C#? 给定字符串输入,如何在 C# 中计算年龄? - Given string input, how do I calculate age in C#? 如何更改网页源代码,包括(html,css,js)geckofx c# - how can i change webpage source including(html,css,js) geckofx c# 在转换为html时,如何使字符串变量(在c#中)为粗体? - How do I make a string variable (in c#) bold when converting to html? 如何将 C# 字符串字典读取和写入文件? - How do I read and write a C# string Dictionary to a file? 在Vb6和c#中我如何拥有多个字符串 - in Vb6 and c# how do I have multiple of a string 我在C#中有一个2d(nxn)字符串数组,如何将其动态输出到网页(尝试过DataTables / Binding等)。 - I have a 2d (n x n) string array in C#, how do I get it ouputted to a webpage dynamically (Tried DataTables/Binding, etc…) 在C#中,如何读取存储在web.config文件连接字符串中的连接字符串? - In C# , how can I read a connection string stored in my web.config file connection string? 如何记录ac#dll - How do I document a c# dll
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM