简体   繁体   中英

Screen Scraping, Web Scraping, Web Harvesting, Web Data Extraction, etc. using C# and the .NET Framework

I am working on a Microsoft .NET Application in C# for Web Harvesting, Web Scraping, Web Data Extraction, Screen Scraping, etc. Whatever you want to call it. For parsing HTML , I'm attempting to incorporate HTML Agility Pack but it's not as easy as I thought it would be. I have included some specifications and images of what I have so far and was hoping to get your opinions on how I could proceed. basically, I want to do something similar to the layout used in Visual Web Ripper but I have no idea how they do it... Any ideas?

Specifications:

My goal is to make a very user friendly point-and-click application for downloading data and images from the web. I would like to load HTML pages using the web browser, and output the parsed data and image links into the text box. The user can specify which HTML tags they want and then download the data into the grid. Finally, export the data into whatever format they need.

I'm trying to use HTML Agility Pack to load the HTML on the webpage and display it in the textbox.

// Load Web Browser
private void Form6_Load(object sender, EventArgs e)
{
    // Navigate to webpage
    webBrowser.Navigate("http://www.webopedia.com/TERM/H/HTML.html");
    // Save URL to memory
    SiteMemoryArray[count] = urlTextBox.Text; 
    // Load HTML from webBrowser
    HtmlWindow window = webBrowser.Document.Window; 
    string str = window.Document.Body.OuterHtml;
    // Extract tags using HtmlAgilityPack and display in textbox
    HtmlAgilityPack.HtmlDocument HtmlDoc = new HtmlAgilityPack.HtmlDocument();
    HtmlDoc.LoadHtml(str);
    HtmlAgilityPack.HtmlNodeCollection Nodes =
        HtmlDoc.DocumentNode.SelectNodes("//a");
    foreach (HtmlAgilityPack.HtmlNode Node in Nodes)
    {
        textBox2.Text += Node.OuterHtml + "\r\n";
    }
}

Using:

HtmlWindow window = webBrowser.Document.Window;

I get the error: Object reference not set to an instance of an object .

You might not have the page load completed when you are referencing the browser window. You can have the browser control fire the navigationcomplete event when it is done. See this SO answer for an example: C# how to wait for a webpage to finish loading before continuing

I am not familiar with HTMLAgilityPack but one component I have used in the past is SGMLReader : http://developer.mindtouch.com/SgmlReader . This functions like a drop-in replacement for an XMLReader and will even convert the document to XML for you if you want. You can load it up into an XMLDocument (or even an XDocument ) and then it's up to you what you do with it.

So I'd suggest using a HTTPWebRequest to get the HTML and then load the HTML into this component. that way you don't need to go anywhere near a WebBrowser control.

For screen scraping, if you are searching for particular images/shapes, you can use:

EMGU

You can also read the screen using WinAPI as such:

private Bitmap Capture(IntPtr hwnd)
{
    return Capture(hwnd, GetClientRectangle());
}

private Bitmap Capture(IntPtr hwnd, Rectangle zone)
{
    IntPtr hdcSrc = GetWindowDC(hwnd);
    IntPtr hdcDest = CreateCompatibleDC(hdcSrc);
    IntPtr hBitmap = CreateCompatibleBitmap(hdcSrc, zone.Width, zone.Height);
    IntPtr hOld = SelectObject(hdcDest, hBitmap);
    BitBlt(hdcDest, 0, 0, zone.Width, zone.Height, hdcSrc, zone.X, zone.Y, SRCCOPY);
    SelectObject(hdcDest, hOld);
    DeleteDC(hdcDest);
    ReleaseDC(hwnd, hdcSrc);
    Bitmap retBitmap = Bitmap.FromHbitmap(hBitmap);
    DeleteObject(hBitmap);
    return retBitmap;
}

To parse a HTML document:

using SHDocVw; //Interop.SHDocVw.dll
using mshtml; //Microsoft.mshtml.dll

InternetExplorer ie= new InternetExplorer();
ie.Navigate("www.example.com");
ie.Visible = true;
Thread.Sleep(5000); //Wait until page loads.
mshtml.HTMLDocument doc;
doc = ie.Document; //Gives the HTML document of the page.

To get all elements of a tag:

//HTML element's tag name:
IHTMLElementCollection AnchorColl = body.getElementsByTagName("a");

And parse the AnchorColl for all elements of that tag.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM