简体   繁体   English

获取登录表单保护的Drupal网站的内容

[英]Grab the contents of a Drupal website that is secured with a login form

I would like to grab some content from a website that is made with Drupal. 我想从Drupal制作的网站上获取一些内容。 The challenge here is that i need to login on this site before i can access the page i want to scrape. 这里的挑战是,我需要先登录此网站才能访问要抓取的页面。 Is there a way to automate this login process in my C# code, so i can grab the secure content? 有没有一种方法可以在我的C#代码中自动执行此登录过程,以便获取安全内容?

You'll have to use the Services module to do that. 您必须使用“ 服务”模块来执行此操作。 Also check out this link for a bit of explanation. 另请查看链接以获取一些说明。

To access the secured content, you'll need to store and send cookies with every request to your server, starting with the request that sends your log in info and then saving the session cookie that the server gives you (which is your proof that you are who you say you are). 要访问受保护的内容,您需要将cookie与每个请求一起存储并发送到服务器,从发送登录信息的请求开始,然后保存服务器为您提供的会话cookie(这证明您已你说的是谁)。

You can use the System.Windows.Forms.WebBrowser for a less control but out-of-the-box solution that will handle cookies. 您可以使用System.Windows.Forms.WebBrowser获得较少的控制权,但可以使用现成的解决方案来处理Cookie。

My preferred method is to use System.Net.HttpWebRequest to send and receive all web data and then use the HtmlAgilityPack to parse the returned data into a Document Object Model (DOM) which can be easily read from. 我的首选方法是使用System.Net.HttpWebRequest发送和接收所有Web数据,然后使用HtmlAgilityPack将返回的数据解析为文档对象模型 (DOM),该文档对象模型可以轻松读取。

The trick to getting System.Net.HttpWebRequest to work is that you must create a long-lived System.Net.CookieContainer that will keep track of your log in info (and other things the server expects you to keep track of). 使System.Net.HttpWebRequest正常工作的技巧是,您必须创建一个长期存在的System.Net.CookieContainer ,它将跟踪您的登录信息(以及服务器希望您跟踪的其他内容)。 The good news is that the HttpWebRequest will take care of all of this for you if you provide the container. 好消息是,如果您提供容器,则HttpWebRequest将为您解决所有这些问题。

You need a new HttpWebRequest for each call you make, so you must sets their .CookieContainer to the same object every time. 每个调用都需要一个新的HttpWebRequest ,因此每次都必须将其.CookieContainer设置为相同的对象。 Here is an example: 这是一个例子:

UNTESTED 未测试

using System.Net;

public void TestConnect()
{
    CookieContainer cookieJar = new CookieContainer();

    HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.mysite.com/login.htm");
    request.CookieContainer = cookieJar;
    HttpWebResponse response = (HttpWebResponse) request.GetResponse();

    // do page parsing and request setting here
    request = (HttpWebRequest)WebRequest.Create("http://www.mysite.com/submit_login.htm");
    // add specific page parameters here
    request.CookeContainer = cookieJar;
    response = (HttpWebResponse) request.GetResponse();

    request = (HttpWebRequest)WebRequest.Create("http://www.mysite.com/secured_page.htm");
    request.CookeContainer = cookieJar;
    // this will now work since you have saved your authentication cookies in 'cookieJar'
    response = (HttpWebResponse) request.GetResponse();
}

http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.aspx http://msdn.microsoft.com/zh-CN/library/system.windows.forms.webbrowser.aspx

HttpWebRequest Class HttpWebRequest类别

http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.cookiecontainer.aspx http://msdn.microsoft.com/zh-CN/library/system.net.httpwebrequest.cookiecontainer.aspx

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM