简体   繁体   中英

Grab the contents of a Drupal website that is secured with a login form

I would like to grab some content from a website that is made with Drupal. The challenge here is that i need to login on this site before i can access the page i want to scrape. Is there a way to automate this login process in my C# code, so i can grab the secure content?

You'll have to use the Services module to do that. Also check out this link for a bit of explanation.

To access the secured content, you'll need to store and send cookies with every request to your server, starting with the request that sends your log in info and then saving the session cookie that the server gives you (which is your proof that you are who you say you are).

You can use the System.Windows.Forms.WebBrowser for a less control but out-of-the-box solution that will handle cookies.

My preferred method is to use System.Net.HttpWebRequest to send and receive all web data and then use the HtmlAgilityPack to parse the returned data into a Document Object Model (DOM) which can be easily read from.

The trick to getting System.Net.HttpWebRequest to work is that you must create a long-lived System.Net.CookieContainer that will keep track of your log in info (and other things the server expects you to keep track of). The good news is that the HttpWebRequest will take care of all of this for you if you provide the container.

You need a new HttpWebRequest for each call you make, so you must sets their .CookieContainer to the same object every time. Here is an example:

UNTESTED

using System.Net;

public void TestConnect()
{
    CookieContainer cookieJar = new CookieContainer();

    HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.mysite.com/login.htm");
    request.CookieContainer = cookieJar;
    HttpWebResponse response = (HttpWebResponse) request.GetResponse();

    // do page parsing and request setting here
    request = (HttpWebRequest)WebRequest.Create("http://www.mysite.com/submit_login.htm");
    // add specific page parameters here
    request.CookeContainer = cookieJar;
    response = (HttpWebResponse) request.GetResponse();

    request = (HttpWebRequest)WebRequest.Create("http://www.mysite.com/secured_page.htm");
    request.CookeContainer = cookieJar;
    // this will now work since you have saved your authentication cookies in 'cookieJar'
    response = (HttpWebResponse) request.GetResponse();
}

http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.aspx

HttpWebRequest Class

http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.cookiecontainer.aspx

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM