简体   繁体   中英

Scrape a GWT Website using HTTPWebRequest in C#

I want to scrape (Screen Scrape) a website developed using Google Web Toolkit and the page which I am trying to scrape seems to be a flash page.

I use the following code.

HttpWebRequest request   = (HttpWebRequest)HttpWebRequest.Create("https://xxx);
request.Method            = "POST";
request.UserAgent         = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36";
request.Headers["Cookie"] = SessionID;
request.Referer           = "xxx";
request.Accept            = "*/*";

request.Headers["X-GWT-Permutation"] = "E81756AE355F23274CB68B43D62F0248";
request.Headers["X-GWT-Module-Base"] = "https://xxx";

byte[] buffer             = System.Text.Encoding.ASCII.GetBytes(encodeData("7|0|6|https://xxx"));  //);
Stream PostData           = request.GetRequestStream();

PostData.Write(buffer, 0, buffer.Length);
PostData.Close();

HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream stream            = response.GetResponseStream();

I get an output throwing an error as below /EX[2,1,["com.google.gwt.user.client.rpc.IncompatibleRemoteServiceException/3936916533","Parameter 0 of is of an unknown type 'java.lang.String%2F2004016611'"],0,7]

Looking towards your reply.

Using the HttpWebRequest method is very efficient but, in my experience, many issues such as the one you are having will result when the scripts on the page have no actual web browser to render in.

I get around this by putting an actual WebBrowser control on a windows form and then navigating that control to the desired URL. When the control (which is a real web browser) is finished rendering the page, you can access anything on it. For instance, if there is text which has been scrambled in a hidden DIV element (a common tactic to thwart scraping) you can't get at it from the page source, but in the WebBrowser control you still can get at it simply because in order to be displayed it had to have been unscrambled by a client script. In the WebBrowser control "if you can see it, you can get it".

EDIT: Simplest implementation. Just try it and see if the missing source code is now present in the document source.

Create a windows form. Add a button called button1. Add a WebBrowser control called webBrowser1, undock it and size it. Add a TextBox called textBox1 and make it multi-line. Then add this code behind the form:

  private void button1_Click(object sender, EventArgs e) {
     string url = "http://google.ca";
     webBrowser1.Navigate(url);
  }

  private void webBrowser1_DocumentCompleted(object sender,
                               WebBrowserDocumentCompletedEventArgs e) {
     string pageSource = webBrowser1.Document.Body.OuterHtml;
     textBox1.Text = pageSource;
  }

Replace the above URL with your URL. Run the form and click the button. See if the HTML you wanted is now present in the resulting text box. If it is, most of your problem is now solved. I have had great luck with this technique. It should solve any problems with content that is rendered late and therefore not present in an unobfuscated form in the actual source code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM