简体   繁体   English

HTML解析C#HTMLAgilityPack

[英]HTML Parsing C# HTMLAgilityPack

I am having a problem reading some values from a HTML string using the HTMLAgilityPack. 我在使用HTMLAgilityPack从HTML字符串读取某些值时遇到问题。

The Two Items i want to read are Newspaper : 82548828 and Fish : 8545852485 我要阅读的两个项目是报纸:82548828和鱼:8545452485

But using the code i have wrote so far i can only ever get back the Newspaper item. 但是使用到目前为止我编写的代码,我只能拿回Newspaper项目。

I suspect the XPATH i am using is not fully correct, i think the XPATH for the first loop is corrrect as this gives me back the two 我怀疑我使用的XPATH并不完全正确,我认为第一个循环的XPATH是正确的,因为这使我退回了两个

I want my second loop to loop over these two items (it thinks there are 6???) 我希望我的第二个循环遍历这两个项目(它认为有6个???)

Also is div2.SelectSingleNode(sXPathT); 也是div2.SelectSingleNode(sXPathT); the correct way to extract the groupLabel? 提取groupLabel的正确方法? or is there a better way? 或者,还有更好的方法?

Thanks 谢谢

Full Test Code Below 完整的测试代码如下

string strTestHTML = @"<div class=\""content\"" data-id=\""123456789\"">" + 
                              "  <div class=\"m-group item\">" +
                              "      <span class=\"group\">" +
                              "          <a href=\"javascript:void(0);\">" +
                              "          <span class=\"group-label\">Newspaper </span>" +
                              "          <span class=\"group-value\">82548828</span>" +
                              "          </a>" +
                              "      </span>" +
                              "      <span class=\"group\">" +
                              "          <a href=\"javascript:void(0);\">" +
                              "          <span class=\"group-label\">Fish </span>" +
                              "          <span class=\"group-value\">8545852485</span>" +
                              "          </a>" +
                              "      </span>" +
                              "  </div>" +
                              "</div>";


            //<div class="content" data-id="123456789">
            string sNewXpath = "//div[contains(@class,'content') and contains(@data-id, '" + "123456789" + "')]";
            //<div class="m-group item">
            string sSecondXPath = "/div[contains(@class,'m-group item')]";
            //<span class="group"
            string sThirdXPath = "//span[contains(@class,'group')]";

            string sXPathT = "//span[contains(@class,'group-label')]";
            string sXPathO = "//span[contains(@class,'group-value')]";

            HtmlAgilityPack.HtmlDocument Doc = new HtmlDocument();
            Doc.LoadHtml(strTestHTML);

            foreach (HtmlNode div in Doc.DocumentNode.SelectNodes(sNewXpath + sSecondXPath))
            {
                foreach (HtmlNode div2 in div.SelectNodes(sThirdXPath))
                {
                    var vOddL = div2.SelectSingleNode(sXPathT);
                    var vOddP = div2.SelectSingleNode(sXPathO);

                    string GroupLabel = vOddL.InnerText.Trim();

                    string GroupValue = vOddP.InnerText.Trim();
                }
            }

EDIT: 编辑:

Worked out why i was getting 6 items back in the forloop 弄清楚为什么我在forloop中拿回了6件物品

sThirdXPath was : string sThirdXPath = "//span[contains(@class,'group')]"; sThirdXPath是:字符串sThirdXPath =“ // span [包含(@ class,'group')]”“;

should be: 应该:

string sThirdXPath = "//span[@class='group']"; 字符串sThirdXPath =“ // span [@ class ='group']”;

Still trying to find the right way to interrogate the HTMLNode contained in div2 to find the values of interest. 仍在尝试寻找正确的方法来查询div2中包含的HTMLNode来找到感兴趣的值。 I assume it needs XPath to match iinside the current node, not HTML document wide. 我认为它需要XPath才能在当前节点内匹配,而不是HTML文档范围内。

Updated HTML Sample: 更新的HTML示例:

<div class="content" data-id="123456789">
<div class="m-group item">
    <span class="group">
        <a href="javascript:void(0);">
        <span class="group-label">Newspaper </span>
        <span class="group-value">82548828</span>
        </a>
    </span>

    <span class="group">
        <a href="javascript:void(0);">
        <span class="group-label">Fish </span>
        <span class="group-value">8545852485</span>
        </a>
    </span>
</div>
</div>

<div class="content" data-id="987654321">
<div class="m-group item">
    <span class="group">
        <a href="javascript:void(0);">
        <span class="group-label">Bread</span>
        <span class="group-value">82548828</span>
        </a>
    </span>

    <span class="group">
        <a href="javascript:void(0);">
        <span class="group-label">Milk </span>
        <span class="group-value">8545852485</span>
        </a>
    </span>
</div>
</div>

In the above example what is the correct XPATH to access Just Bread and Its Value and Milk and its Value. 在上面的示例中,什么是访问Just Bread及其值和Milk及其值的正确XPATH。 I assume i need to filter on data-id="987654321 in the XPath? 我假设我需要过滤XPath中的data-id =“ 987654321?

Your suspicion is correct, you already specified the XPath queries for the full path so you don't need a loop. 您的怀疑是正确的,您已经为完整路径指定了XPath查询,因此不需要循环。 To get "Newspaper" and "Fish" nodes in this example you can simply use SelectNodes instead of looping and calling SelectSingleNode. 在此示例中,要获取“报纸”和“鱼”节点,您可以简单地使用SelectNodes而不是循环并调用SelectSingleNode。 If there are more items you can loop through the result set of course, I accessed them by index in this example as there are only two of them. 当然,如果还有更多项目可以循环浏览结果集,那么在本示例中,我将通过索引访问它们,因为其中只有两个。

string sXPathT = "//span[contains(@class,'group-label')]";
string sXPathO = "//span[contains(@class,'group-value')]";

HtmlAgilityPack.HtmlDocument Doc = new HtmlDocument();
Doc.LoadHtml(strTestHTML);

var vOddL = Doc.DocumentNode.SelectNodes(sXPathT);
var vOddP = Doc.DocumentNode.SelectNodes(sXPathO);

string GroupLabelNewsPaper = vOddL.ElementAt(0).InnerText.Trim();
string GroupLabelFish = vOddL.ElementAt(1).InnerText.Trim();

string GroupValueNewspaper = vOddP.ElementAt(0).InnerText.Trim();
string GroupValueFish = vOddP.ElementAt(1).InnerText.Trim();

Console.WriteLine($"{GroupLabelNewsPaper}\t{GroupValueNewspaper}");
Console.WriteLine($"{GroupLabelFish}\t{GroupValueFish}");

Output: 输出:

Newspaper       82548828
Fish    8545852485

UPDATE: To get a specific content node you can use this XPath: 更新:要获取特定的内容节点,可以使用以下XPath:

string xpathForDataId = "//div[@class='content' and @data-id='987654321']";

You can filter the divs with the above expression then get the child nodes of this like this: 您可以使用上面的表达式过滤div,然后像这样获取子节点:

string sXPathT = ".//span[contains(@class,'group-label')]";
string sXPathO = ".//span[contains(@class,'group-value')]";
string xpathForDataId = "//div[@class='content' and @data-id='987654321']";

HtmlAgilityPack.HtmlDocument Doc = new HtmlDocument();
Doc.LoadHtml(strTestHTML);

var specificNode = Doc.DocumentNode.SelectSingleNode(xpathForDataId);

var vOddL = specificNode.SelectNodes(sXPathT);
var vOddP = specificNode.SelectNodes(sXPathO);

string GroupLabelBread = vOddL.ElementAt(0).InnerText.Trim();
string GroupLabelMilk = vOddL.ElementAt(1).InnerText.Trim();

string GroupValueBread = vOddP.ElementAt(0).InnerText.Trim();
string GroupValueMilk = vOddP.ElementAt(1).InnerText.Trim();

Console.WriteLine($"{GroupLabelBread}\t{GroupValueBread}");
Console.WriteLine($"{GroupLabelMilk}\t{GroupValueMilk}");

Notice the ".//" in the sXPathT and sXPathO. 注意sXPathT和sXPathO中的“ .//”。 By that we search the current context only and not the whole document. 这样,我们仅搜索当前上下文,而不搜索整个文档。

Output: 输出:

Bread   82548828
Milk    8545852485

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM