简体   繁体   中英

Get links with specific words from a HTML code in C#

I am trying to parse a website. I need some links in HTML file which contains some specific words. I know how to find "href" attributes but I don't need all of them, is there anyway to do that? For example can I use regex in HtmlAgilityPack?

HtmlNode links = document.DocumentNode.SelectSingleNode("//*[@id='navigation']/div/ul");

foreach (HtmlNode urls in document.DocumentNode.SelectNodes("//a[@]"))
{
    this.dgvurl.Rows.Add(urls.Attributes["href"].Value);
}   

I'm trying this for finding all links in HTML code.

If you have an HTML file like this:

<div class="a">
    <a href="http://www.website.com/"></a>
    <a href="http://www.website.com/notfound"></a>
    <a href="http://www.website.com/theword"></a>
    <a href="http://www.website.com/sub/theword"></a>
    <a href="http://www.website.com/theword.html"></a>
    <a href="http://www.website.com/other"></a>
</div>

And you're searching for example the following words: theword and other . You can define a regular expression, then use LINQ to get the links with an attribute href matching your regular expression like this:

Regex regex = new Regex("(theworld|other)", RegexOptions.IgnoreCase);

HtmlNode node = htmlDoc.DocumentNode.SelectSingleNode("//div[@class='a']");
List<HtmlNode> nodeList = node.SelectNodes(".//a").Where(a => regex.IsMatch(a.Attributes["href"].Value)).ToList<HtmlNode>();

List<string> urls = new List<string>();

foreach (HtmlNode n in nodeList)
{
    urls.Add(n.Attributes["href"].Value);
}

Note that there's a contains keyword with XPATH, but you'll have to duplicate the condition for each word you're searching like:

node.SelectNodes(".//a[contains(@href,'theword') or contains(@href,'other')]")

There's also a matches keyword for XPATH, unfortunately it's only available with XPATH 2.0 and HtmlAgilityPack uses XPATH 1.0. With XPATH 2.0, you could do something like this:

node.SelectNodes(".//a[matches(@href,'(theword|other)')]")

I Find this and that works for me.

HtmlNode links = document.DocumentNode.SelectSingleNode("//*[@id='navigation']/div/ul");
    foreach (HtmlNode urls in document.DocumentNode.SelectNodes("//a[@]"))
        {
           var temp = catagory.Attributes["href"].Value;
           if (temp.Contains("some_word"))
              {
                dgv.Rows.Add(temp);
              }
        }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM