简体   繁体   English

使用 C# 和 HTMLAgility 抓取网页

[英]Scraping a webpage with C# and HTMLAgility

I have read that HTMLAgility 1.4 is a great solution to scraping a webpage.我读到HTMLAgility 1.4是抓取网页的绝佳解决方案。 Being a new programmer I am hoping I could get some input on this project.作为一名新程序员,我希望我能在这个项目上得到一些投入。 I am doing this as a C# application form.我这样做是作为一个C#应用程序表单。 The page I am working with is fairly straight forward.我正在使用的页面相当简单。 The information I need is stuck between just 2 tags <table class="data"> and </table> .我需要的信息仅停留在 2 个标签<table class="data"></table>

My goal is to pull the data for Part-Num , Manu-Number , Description , Manu-Country , Last Modified , Last Modified By , out of the page and send the data to a SQL table.我的目标是从页面中提取Part-NumManu-NumberDescriptionManu-CountryLast ModifiedLast Modified By的数据并将数据发送到SQL表。

One twist is that there is also a small PNG picture that also need to be grabbed from the src="/partcode/number .一个变化是还有一个小的PNG图片也需要从src="/partcode/number

I do not have any completed code that woks.我没有任何可以工作的完整代码。 I thought this bit of code would tell me if I am heading in the right direction.我认为这段代码会告诉我我是否朝着正确的方向前进。 Even stepping into the debug I can't see that it does anything.即使进入调试,我也看不到它做任何事情。 Could someone possibly point me in the right direction on this.有人可能会指出我在这方面的正确方向。 The more detailed the better since it is apparent I have a lot to learn.越详细越好,因为很明显我有很多东西要学。

Thank you I would really appreciate it.谢谢你,我真的很感激。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
using System.Xml;

namespace Stats
{
    class PartParser
    {
        static void Main(string[] args)
        {
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml("http://localhost");
            //My understanding this reads the entire page in?
            var tables = doc.DocumentNode.SelectNodes("//table");
            // I assume that this sets up the search for words containing table
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
            Console.WriteLine(ex.StackTrace);
            Console.ReadKey();    
        }
    }
}

The web code is:网页代码是:

<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <head>
        <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
        <title>Part Number Database: Item Record</title>
        <table class="data">
            <tr><td>Part-Num</td><td width="50"></td><td>
            <img src="/partcode/number/072140" alt="072140"/></td></tr>
            <tr><td>Manu-Number</td><td width="50"></td><td>
            <img src="/partcode/manu/00721408" alt="00721408" /></td></tr>    
            <tr><td>Description</td><td></td><td>Widget 3.5</td></tr>
            <tr><td>Manu-Country</td><td></td><td>United States</td></tr>    
            <tr><td>Last Modified</td><td></td><td>26 Jan 2009,  8:08 PM</td></tr>    
            <tr><td>Last Modified By</td><td></td><td>Manu</td></tr>
        </table>
    <head/>
</html>

Check out this article on 4GuysFromRolla 查看4GuysFromRolla上的这篇文章

http://www.4guysfromrolla.com/articles/011211-1.aspx http://www.4guysfromrolla.com/articles/011211-1.aspx

This is the article I used as my starting point with HTML Agility Pack and it's worked great. 这是我用HTML Agility Pack作为起点的文章,它的效果非常好。 I'm confident that you'll get all the information you need from this article to perform the tasks you're trying to complete. 我相信您将从本文中获得所需的所有信息,以执行您尝试完成的任务。

The beginning part is off:开始部分关闭:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("http://localhost");   

LoadHtml(html) loads an html string into the document, I think you want something like this instead: LoadHtml(html)将一个 html 字符串加载到文档中,我想你想要这样的东西:

HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc  = htmlWeb.Load("http://stackoverflow.com");

A working code, according to the HTML source you provided.工作代码,根据您提供的 HTML 源代码。 It can be factorized, and I'm not checking for null values (in rows , cells , and each value inside the case ).它可以被分解,并且我不会检查值( rowscellscase每个值中)。 If you have the page in 127.0.0.1 , that will work.如果您在127.0.0.1 中有该页面,那将起作用。 Just paste it inside the Main method of a Console Application and try to understand it.只需将其粘贴到控制台应用程序Main方法中并尝试理解它。

HtmlDocument doc = new HtmlWeb().Load("http://127.0.0.1");    

var rows = doc.DocumentNode.SelectNodes("//table[@class='data']/tr");
foreach (var row in rows)
{
    var cells = row.SelectNodes("./td");
    string title = cells[0].InnerText;
    var valueRow = cells[2];
    switch (title)
    {
        case "Part-Num":
            string partNum = valueRow.SelectSingleNode("./img[@alt]").Attributes["alt"].Value;
            Console.WriteLine("Part-Num:\t" + partNum);
            break;
        case "Manu-Number":
            string manuNumber = valueRow.SelectSingleNode("./img[@alt]").Attributes["alt"].Value;
            Console.WriteLine("Manu-Num:\t" + manuNumber);
            break;
        case "Description":
            string description = valueRow.InnerText;
            Console.WriteLine("Description:\t" + description);
            break;
        case "Manu-Country":
            string manuCountry = valueRow.InnerText;
            Console.WriteLine("Manu-Country:\t" + manuCountry);
            break;
        case "Last Modified":
            string lastModified = valueRow.InnerText;
            Console.WriteLine("Last Modified:\t" + lastModified);
            break;
        case "Last Modified By":
            string lastModifiedBy = valueRow.InnerText;
            Console.WriteLine("Last Modified By:\t" + lastModifiedBy);
            break;
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM