简体   繁体   中英

Web/Screen Scraping with Google App Engine - Code works in python interpreter but not GAE

I want to do some web scraping with GAE. (Infinite Campus Student Information Portal, fyi). This service requires you to login to get in the website. I had some code that worked using mechanize in normal python. When I learned that I couldn't use mechanize in Google App Engine I ended up using urllib2 + ClientForm. I couldn't get it to login to the server, so after a few hours of fiddling with cookie handling I ran the exact same code in a normal python interpreter, and it worked. I found the log file and saw a ton of messages about stripping out the 'host' header in my request... I found the source file on Google Code and the host header was in an 'untrusted' list and removed from all requests by user code.

Apparently GAE strips out the host header, which is required by IC to determine which school system to log you in, which is why it appeared like I couldn't login.

How would I get around this problem? I can't specify anything else in my fake form submission to the target site. Why would this be a "security hole" in the first place?

App Engine does not strip out the Host header: it forces it to be an accurate value based on the URI you are requesting. Assuming that URI's absolute, the server isn't even allowed to consider the Host header anyway, per RFC2616 :

  1. If Request-URI is an absoluteURI, the host is part of the Request-URI. Any Host header field value in the request MUST be ignored.

...so I suspect you're misdiagnosing the cause of your problem. Try directing the request to a "dummy" server that you control (eg another very simple app engine app of yours) so you can look at all the headers and body of the request as it comes from your GAE app, vs, how it comes from your "normal python interpreter". What do you observe this way?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM