简体   繁体   English

如何在htmlagility中的两个锚标记之间提取不同的div?

[英]How to extract different divs between two anchor tags in htmlagility?

<html>
<A NAME="doc_id_1"></A>

<div class="find1">
Iam here, extract me.
</div>
<div class = "find2">

iam here also, extract me as well.
</div>

<A NAME="doc_id_2"></A>

</html>

I have used below code to extract data : 我用下面的代码提取数据:

    var nodes = doc.DocumentNode.SelectNodes("//a[@name = 'doc_id_1']");
    var nodes1 = doc.DocumentNode.SelectNodes("//a[@name = 'doc_id_2']");

    foreach (HtmlNode node in nodes)
    {
        string yourText1 = node.InnerText;
        //var yourText2 = node.NextSibling.SelectNodes("//div");
        string yourText2 = node.NextSibling.InnerHtml;

        //foreach (HtmlNode var in yourText2)
        //{
        //    string yourText3 = var.InnerHtml;
        //}

    }

I don't want to give class name on those div. 我不想在这些div上给出类名。 Because i am writing a generic code.Any help will be appreciated. 因为我正在编写通用代码。任何帮助将不胜感激。

I'm assuming you will know the name value of the two anchor tags. 我假设您将知道两个锚标记的名称值。

var doc = new HtmlDocument();

var firstAnchor = doc.DocumentNode.SelectSingleNode("//a[@name = 'doc_id_1']");

var div = firstAnchor.NextSibling;

while (div.Name != "doc_id_2") //when the name of the second anchor is found we have no more divs
{
    var divText = div.InnerText; //do whatever with this
    div = div.NextSibling;
}

One option, using Linq: 一种选择,使用Linq:

var doc = new HtmlDocument();
doc.LoadHtml(html: Resources.Html);

var startNode = doc.DocumentNode.SelectSingleNode("//a[@name = 'doc_id_1']");
var endNode = doc.DocumentNode.SelectSingleNode("//a[@name = 'doc_id_2']");

var parent = startNode.ParentNode;

var nodesYouWant = parent.ChildNodes
    .SkipWhile(node => node != startNode)   // skip all nodes up to the start node
    .Skip(1)                                // skip the start node
    .TakeWhile(node => node != endNode)     // take all nodes up to the next anchor
    .Where(node => node.Name == "div");     // select only div nodes

Or: 要么:

var currentNode = doc.DocumentNode.SelectSingleNode("//a[@name = 'doc_id_1']");
var endNode = doc.DocumentNode.SelectSingleNode("//a[@name = 'doc_id_2']");

var nodesYouWant = GetEnclosedNodes(currentNode, endNode).Where(node => node.Name == "div");

private static IEnumerable<HtmlNode> GetEnclosedNodes(HtmlNode currentNode, HtmlNode endNode)
{
    currentNode = currentNode.NextSibling;

    while (currentNode != null && currentNode != endNode)
    {
        yield return currentNode;

        currentNode = currentNode.NextSibling;
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM