简体   繁体   English

从 C# 中的 HTML 代码中获取带有特定单词的链接

[英]Get links with specific words from a HTML code in C#

I am trying to parse a website.我正在尝试解析一个网站。 I need some links in HTML file which contains some specific words.我需要包含一些特定单词的 HTML 文件中的一些链接。 I know how to find "href" attributes but I don't need all of them, is there anyway to do that?我知道如何找到“href”属性,但我不需要所有这些属性,无论如何要这样做吗? For example can I use regex in HtmlAgilityPack?例如,我可以在 HtmlAgilityPack 中使用正则表达式吗?

HtmlNode links = document.DocumentNode.SelectSingleNode("//*[@id='navigation']/div/ul");

foreach (HtmlNode urls in document.DocumentNode.SelectNodes("//a[@]"))
{
    this.dgvurl.Rows.Add(urls.Attributes["href"].Value);
}   

I'm trying this for finding all links in HTML code.我正在尝试使用此方法查找 HTML 代码中的所有链接。

If you have an HTML file like this:如果您有这样的 HTML 文件:

<div class="a">
    <a href="http://www.website.com/"></a>
    <a href="http://www.website.com/notfound"></a>
    <a href="http://www.website.com/theword"></a>
    <a href="http://www.website.com/sub/theword"></a>
    <a href="http://www.website.com/theword.html"></a>
    <a href="http://www.website.com/other"></a>
</div>

And you're searching for example the following words: theword and other .例如,您正在搜索以下单词: thewordother You can define a regular expression, then use LINQ to get the links with an attribute href matching your regular expression like this:您可以定义一个正则表达式,然后使用 LINQ 获取具有与您的正则表达式匹配的属性href的链接,如下所示:

Regex regex = new Regex("(theworld|other)", RegexOptions.IgnoreCase);

HtmlNode node = htmlDoc.DocumentNode.SelectSingleNode("//div[@class='a']");
List<HtmlNode> nodeList = node.SelectNodes(".//a").Where(a => regex.IsMatch(a.Attributes["href"].Value)).ToList<HtmlNode>();

List<string> urls = new List<string>();

foreach (HtmlNode n in nodeList)
{
    urls.Add(n.Attributes["href"].Value);
}

Note that there's a contains keyword with XPATH, but you'll have to duplicate the condition for each word you're searching like:请注意,XPATH 有一个contains关键字,但您必须为要搜索的每个单词复制条件,例如:

node.SelectNodes(".//a[contains(@href,'theword') or contains(@href,'other')]")

There's also a matches keyword for XPATH, unfortunately it's only available with XPATH 2.0 and HtmlAgilityPack uses XPATH 1.0. XPATH 也有一个matches关键字,不幸的是它仅适用于 XPATH 2.0,而 HtmlAgilityPack 使用 XPATH 1.0。 With XPATH 2.0, you could do something like this:使用 XPATH 2.0,您可以执行以下操作:

node.SelectNodes(".//a[matches(@href,'(theword|other)')]")

I Find this and that works for me.我找到了这个,这对我有用。

HtmlNode links = document.DocumentNode.SelectSingleNode("//*[@id='navigation']/div/ul");
    foreach (HtmlNode urls in document.DocumentNode.SelectNodes("//a[@]"))
        {
           var temp = catagory.Attributes["href"].Value;
           if (temp.Contains("some_word"))
              {
                dgv.Rows.Add(temp);
              }
        }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM