简体   繁体   中英

XPath/HtmlAgilityPack: How to find an element (a) with a specific value for an attribute (href) and find adjacent table columns?

I'm pretty desperate because I can't figure out how to achieve what I stated in the question. I've already read countless of similar examples but didn't find one which works in exact situation. So, let's say I have the following code:

<table><tr>
<td><a href="url-a">text A</a></td><td><a>id A</a></td><td><a>img A</a></td>
<td><a href="url-b">text B</a></td><td><a>id B</a></td><td><a>img B</a></td>
<td><a href="url-c">text C</a></td><td><a>id C</a></td><td><a>img C</a></td>
</tr></table>

Now, what I already have is a part of url-a. I basically want to know how I can get id A and img A. I'm trying to "find" the line with XPath but I can't work out a way to make it work. Also, it might be possible that the information is not present at all. This is my latest try (seriously, I've tinkered with this for more than 3 hours now trying numerous different ways):

if (htmlDoc.DocumentNode.SelectSingleNode(@"/a[contains(@href, 'part-url-a')]") != null)
    string ida = htmlDoc.DocumentNode.SelectSingleNode(@"/a[contains(@href, 'part-url-a')]/following-sibling::a").InnerText;

Well, it's apparently wrong as hell so I'd be very pleased if someone could help me out here. Also I'd appreciate it if someone could point me to some Website which explains XPath and the notations/Syntax in detail with examples like this one. Books also welcome.

PS: I know I could achieve my goal without XPath at all too with Regex or just a simple StreamReader in C# and checking if each line contains what I need but a) it's too fragile for my needs because the code might have abrupt line-breaks and b) I really want to stay consistend with sticking completely to XPath for anything I'm doing in this project.

Thanks in advance for your help!

Use the following XPath expressions :

   /*/tr/td[a[@href='url-a']]
                /following-sibling::td[1]
                     /a/text()

When evaluated against the provided (malformed but corrected) XML document :

<table><tr>
<td><a href="url-a">text A</a></td><td><a>id A</a></td><td><a>img A</a></td>
<td><a href="url-b">text B</a></td><td><a>id B</a></td><td><a>img B</a></td>
<td><a href="url-c">text C</a></td><td><a>id C</a></td><td><a>img C</a></td>
</tr></table>

the wanted text node is selected :

id A

Similarly, this XPath expression :

   /*/tr/td[a[@href='url-a']]
                /following-sibling::td[2]
                     /a/text()

when evaluated against the same XML document (above), selects the other wanted text node :

img A

XSLT-based verification :

When this transformation is applied on the XML document (above):

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "/*/tr/td[a[@href='url-a']]
                /following-sibling::td[1]
                     /a/text()"/>

  <xsl:text>&#10;</xsl:text>
  <xsl:copy-of select=
   "/*/tr/td[a[@href='url-a']]
                /following-sibling::td[2]
                     /a/text()"/>
 </xsl:template>
</xsl:stylesheet>

the wanted results are produced :

id A
img A

You have a seriously broken HTML with unmatching closing td tags. Fix them please. It's just an ugly picture this markup.

This being said hopefully Html Agility Pack can handle any crap that you throw at it, so here's how to proceed and parse the junk you have and find the id and img values given the href :

class Program
{
    static void Main()
    {
        var doc = new HtmlDocument();
        doc.Load("test.html");
        var anchor = doc.DocumentNode.SelectSingleNode("//a[contains(@href, 'url-a')]");
        if (anchor != null)
        {
            var id = anchor.ParentNode.SelectSingleNode("following-sibling::td/a");
            if (id != null)
            {
                Console.WriteLine(id.InnerHtml);
                var img = id.ParentNode.SelectSingleNode("following-sibling::td/a");
                if (img != null)
                {
                    Console.WriteLine(img.InnerHtml);
                }
            }
        }
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM