简体   繁体   中英

Find Elements between two anchors include parent of anchor with XPath

I try to parse a autogenerated html file. It is from a HAT and i have no influence in the generated html.

<!DOCTYPE html>
<html lang="de">
    <head>
      <!-- Header bla bla -->
    </head>

    <body class="md-nav-expanded">
      <!-- Some HTML-Elements, that doesn't matter -->

      <div id="main">
        <article>
            <div id="topic-content" class="container-fluid">
                <!-- Uninteresting div -->

                <a id="main-content"></a>

                <h2>Steuerelemente</h2>

                <div class="main-content">
                    
                    <h6 class="rvps5"><a name="MyAnchor1"></a><span class="rvts0"><span class="rvts13">Title 1</span></span></h6>
                    <p class="rvps3"><span class="rvts8">Some text for Title 1 with inner HTML elements.</span></p>
                    <h6 class="rvps5"><a name="MyAnchor2"></a><span class="rvts0"><span class="rvts13">Title 2</span></span></h6>
                    <p class="rvps3"><span class="rvts8">Some text for Title 2.</span></p>
                    <p class="rvps3"><span class="rvts8">Some more text for Title 2.</span></p>
                    <h6 class="rvps5"><a name="MyAnchor3"></a><span class="rvts0"><span class="rvts13">Title 3</span></span></h6>
                    <p class="rvps3"><span class="rvts8">Some text for Title 3</span></p>
                    <p class="rvps3"><span class="rvts8"><br/></span></p>
                    <p class="rvps2" style="clear: both;">
                        <span class="rvts6">Autogenerated Text</span>
                        <!-- This anchor should be ignored, because it has no name attribute -->
                        <a class="rvts7" href="https://www.anywhere.com">Anywhere</a>
                    </p>
                </div>
                <!-- The rest of the HTML doesn't matter -->
            </div>  <!-- /#topic-content -->
        </article>
      </div>  <!-- /#main -->
    </body>
</html>

I try to extract the html from MyAnchor1 (including its parent h6 [could be any other element]) to MyAnchor2. From MyAnchor2 to MyAnchor3 and from MyAnchor3 to the end.

First of all i load the file into a HtmlDocument:

htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.Load(refFile);

Then i find the div 'main-content'

var mainContentDiv = htmlDoc.DocumentNode.SelectNodes("//div[contains(@class, 'main-content')]").FirstOrDefault();

And now i struggle, how the get the html between the anchors. I tried Substring, but the positions in the nodes (StartIndex and InnerLength) seems not to match with the string values.

Another approach was to get the anchors itself, but then i don't know how the get the elements until the next anchor (or the end).

One approach that doesn't work:

var anchors = mainContentDiv.SelectNodes(".//a[@name]");
if (anchors != null)
{
    foreach (var anchor in anchors)
    {
        var anchorName = anchor.GetAttributeValue<string>("name", null);
        var followingNodes = mainContentDiv.SelectNodes(".//*[preceding::a and following::a[@name = '" + anchorName + "']]");
    }
}

Can anyone please help me. Thanks.

Update:

I want to get 3 HTML parts: 1.

<h6 class="rvps5"><a name="MyAnchor1"></a><span class="rvts0"><span class="rvts13">Title 1</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 1 with inner HTML elements.</span></p>
<h6 class="rvps5"><a name="MyAnchor2"></a><span class="rvts0"><span class="rvts13">Title 2</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 2.</span></p>
<p class="rvps3"><span class="rvts8">Some more text for Title 2.</span></p>

and 3.

<h6 class="rvps5"><a name="MyAnchor3"></a><span class="rvts0"><span class="rvts13">Title 3</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 3</span></p>
<p class="rvps3"><span class="rvts8"><br/></span></p>
<p class="rvps2" style="clear: both;">
    <span class="rvts6">Autogenerated Text</span>
    <!-- This anchor should be ignored, because it has no name attribute -->
    <a class="rvts7" href="https://www.anywhere.com">Anywhere</a>
</p>

Working Solution: Finally i have a working solution that consider the unclear structure of the generated html.

var mainContentDiv = htmlDoc.DocumentNode.SelectNodes("//div[contains(@class, 'main-content')]").FirstOrDefault();
var childNodes = mainContentDiv.ChildNodes;

var snippets = new Dictionary<string, string>();
snippets.Add("", mainContentDiv.InnerHtml);

var anchors = mainContentDiv.SelectNodes(".//a[@name]");
if (anchors != null)
{
    foreach (var anchor in anchors)
    {
        var sb = new StringBuilder();

        var anchorName = anchor.GetAttributeValue<string>("name", null);
        var node = anchor;
        while (node.ParentNode.GetAttributeValue<string>("class", null) != "main-content" && node.ParentNode.SelectNodes(".//a[@name]").Count == 1)
        {
            node = node.ParentNode;
        }

        sb.Append(node.OuterHtml);
        while (node.NextSibling != null)
        {
            var nodeCollection = node.NextSibling.SelectNodes(".//a[@name]");
            if (nodeCollection != null)
                break;

            node = node.NextSibling;
            sb.Append(node.OuterHtml);
        }

        snippets.Add(anchorName, sb.ToString());
    }
}

htmlSnippes.Add(helpContextId, snippets);

Thanks all for helping.

You can try using following code:

List<string> htmlParts = new List<string>();
var anchors = mainContentDiv.SelectNodes(".//a[@name]");
if (anchors != null)
{
     foreach (var anchor in anchors)
     {                    
         var node = anchor.ParentNode;

         StringBuilder sb = new StringBuilder(node.OuterHtml);

         while ((node = node.NextSibling) != null)                    
         {
              if (node.SelectSingleNode(".//a[@name]") != null)
                  break;
              else
                  sb.Append(node.OuterHtml);
         }                   

         htmlParts.Add(sb.ToString());
    }
}

The code assumes that each anchor element always has a parent. You will have to adjust it in case this is not always true.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM