简体   繁体   English

如何使用htmlagility pack提取表单标签?

[英]how to extract form tag using htmlagility pack?

I'm using HtmlAgilityPack in one of my C# Projects for scraping. 我在一个C#项目中使用HtmlAgilityPack进行抓取。 I need to scrap the <form> tag from web page. 我需要从网页上删除<form>标记。 I've searched about how to extract form tag using HtmlAgilityPack but couldn't find an answer. 我已经搜索过有关如何使用HtmlAgilityPack提取表单标签的方法,但是找不到答案。 Can anyone tell me how to extract <form> tag using HtmlAgilityPack ? 谁能告诉我如何使用HtmlAgilityPack提取<form>标签?

private void Testing()
        {
            var getHtmlWeb = new HtmlWeb();
            var document = getHtmlWeb.Load(@"http://localhost/final_project/index.php");
            HtmlNode.ElementsFlags.Remove("form");
            var aTags = document.DocumentNode.SelectNodes("//form");
            int counter = 1;
            StringBuilder buffer = new StringBuilder();
            if (aTags != null)
            {
                foreach (var aTag in aTags)
                {
                    buffer.Append(counter + ". " + aTag.InnerHtml + " - " + "\t" + "<br />");
                    counter++;
                }
            }
        }

Here is my code sample. 这是我的代码示例。 I'm scraping a page from my localhost . 我正在从localhost抓取一个页面。 count of aTags is 1 because there is only one form on page. aTags计数为1,因为页面上只有一种形式。 But when I use but my StringBuilder object doesn't contain any InnerHtml of form. 但是当我使用但我的StringBuilder对象不包含任何形式的InnerHtml时。 Where's is the error :( 错误在哪里:(

Here is my html source from which I want to scrap form 这是我想从中提取form html源

<!DOCTYPE html>
<html>
    <head>
    <!-- stylesheet section -->
    <link rel="stylesheet" type="text/css" media="all" href="./_include/style.css">

    <!-- title of the page -->
    <title>Login</title>

    <!-- PHP Section -->
    <!-- Creating a connection with database-->
     <!-- end of PHP Sectoin -->

    </head>
        <body>
            <!-- now we'll check error variable to print warning -->
                        <!-- we'll submit the data to the same page to avoid excessive pages -->
            <form action="/final_project/index.php" method="post">
                <!-- ============================== Fieldset 1 ============================== -->
                <fieldset>
                    <legend>Log in credentials:</legend>
                    <hr class="hrzntlrow" />
                        <label for="input-one"><strong>User Name:</strong></label><br />
                        <input autofocus name="userName" type="text" size="20" id="input-one" class="text" placeholder="User Name" required /><br />

                        <label for="input-two"><strong>Password:</strong></label><br />
                        <input name="password" type="password" size="20" id="input-two" class="text" placeholder="Password" required />
                </fieldset>
                <!-- ============================== Fieldset 1 end ============================== -->

                <p><input type="submit" alt="SUBMIT" name="submit" value="SUBMIT" class="submit-text" /></p>
            </form>
        </body>
</html>

Since form tags are allowed to overlap , HAP handles them differently, to treat form tags as any other element just remove the form flag by calling: 由于允许表单标签重叠,因此HAP对表单标签的处理方式有所不同,将表单标签视为其他元素,只需调用以下方法删除表单标志:

HtmlAgilityPack.HtmlNode.ElementsFlags.Remove("form");

Now your form tags will be handled as you expect, and you can work with the way you work with other tags. 现在,您的表单标签将按照您期望的方式处理,您可以使用与其他标签一起使用的方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM