[英]How come I can't download this webpage in python?
請自己嘗試:)!
curl http://www.windowsphone.com/en-US/apps?list=free
結果是:
<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="https://login.live.com/login.srf?wa=wsignin1.0&rpsnv=11&checkda=1&ct=1320735308&rver=6.1.6195.0&wp=MBI&wreply=http:%2F%2Fwww.windowsphone.com%2Fen-US%2Fapps%3Flist%3Dfree&lc=1033&id=268289">here</a>.</h2>
</body></html>
要么
def download(source_url):
try:
socket.setdefaulttimeout(10)
agents = ['Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)','Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1)','Microsoft Internet Explorer/4.0b1 (Windows 95)','Opera/8.00 (Windows NT 5.1; U; en)']
ree = urllib2.Request(source_url)
ree.add_header('User-Agent',random.choice(agents))
resp = urllib2.urlopen(ree)
htmlSource = resp.read()
return htmlSource
except Exception, e:
print e
return ""
download('http://www.windowsphone.com/en-US/apps?list=free')
結果是:
<html><head><meta http-equiv="REFRESH" content="0; URL=http://www.windowsphone.com/en-US/apps?list=free"><script type="text/javascript">function OnBack(){}</script></head></html>
我想下載該網頁的實際來源。
失敗的原因是因為http://www.windowsphone.com嘗試設置cookie,在https://login.live.com上對其進行了檢查,該cookie創建了另一個cookie,如果成功,則重定向回Windowsphone.com。
您應該查看http://docs.python.org/library/cookielib.html
如果要使用curl,則允許它創建一個cookie文件,如下所示:
curl -so /dev/null 'http://www.windowsphone.com/en-US/apps?list=free' -c 'myCookieJar'
在您的shell中運行more myCookieJar
,您將看到類似以下內容:
# Netscape HTTP Cookie File
# http://www.netscape.com/newsref/std/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
.www.windowsphone.com TRUE / FALSE 0 WPMSLSS SLSS=1
login.live.com FALSE / FALSE 0 MSPRequ lt=1320738008&co=1&id=268289
運行(注意-b選項在“ myCookieJar”之前):
curl -so 'windowsphone.html' 'http://www.windowsphone.com/en-US/apps?list=free' -b 'myCookieJar'
然后您將在瀏覽器中看到的文件Windowsphone.html中獲取頁面的內容。
Flesk確實對此有一個答案(+1)。
調試HTTP連接的另一種簡單方法是Netcat ,它基本上是一個功能強大的telnet實用程序。
因此,假設您要調試HTTP請求中發生的情況:
$ nc www.windowsphone.com 80
GET /en-US/apps?list=free HTTP/1.0
Host: www.windowsphone.com
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)
這會將請求標頭發送到服務器(您需要按兩次 Enter鍵才能發送)。
之后,服務器將響應:
HTTP/1.1 302 Found
Location: https://login.live.com/login.srf?wa=wsignin1.0&rpsnv=11&checkda=1&ct=1320745265&rver=6.1.6195.0&wp=MBI&wreply=http:%2F%2Fwww.windowsphone.com%2Fen-US%2Fapps%3Flist%3Dfree&lc=1033&id=268289
Server: Microsoft-IIS/7.5
Set-Cookie: WPMSLSS=SLSS=1; domain=www.windowsphone.com; path=/; HttpOnly
X-Powered-By: ASP.NET
X-Server: SN2CONXWWBA06
Date: Tue, 08 Nov 2011 09:41:05 GMT
Connection: close
Content-Length: 337
<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="https://login.live.com/login.srf?wa=wsignin1.0&rpsnv=11&checkda=1&ct=1320745265&rver=6.1.6195.0&wp=MBI&wreply=http:%2F%2Fwww.windowsphone.com%2Fen-US%2Fapps%3Flist%3Dfree&lc=1033&id=268289">here</a>.</h2>
</body></html>
因此,服務器返回302,這是用於重定向的HTTP狀態代碼,從而提示“瀏覽器”打開在位置標頭中傳遞的URL。
Netcat是調試和跟蹤各種網絡通信的好工具,當我想更深入地了解HTTP協議時,它對我有很大幫助。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.