[英]Python - getting source code with socket
我想发送http get请求并从网页接收源代码,这必须通过套接字来完成。 我将缓冲区大小设置为4096,但是我的脚本仅从页面下载一小部分
import socket
sock = socket.socket ( socket.AF_INET, socket.SOCK_STREAM )
sock.connect ( ( "edition.cnn.com", 80 ) )
host = socket.gethostbyname("edition.cnn.com")
sock.sendall('GET http://edition.cnn.com/index.html HTTP/1.1\r\n'\
+ 'User-Agent: agent123\r\n'\
+ 'Host: '+host+'\r\n'\
+ '\r\n')
print sock.recv(4096)
sock.close()
运行此代码数据后,我得到的是
HTTP/1.1 200 OK
Server: nginx
Date: Wed, 01 Jan 2014 18:31:25 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: CG=GR:44:Réthimnon; path=/
Last-Modified: Wed, 01 Jan 2014 18:31:22 GMT
Vary: Accept-Encoding
Cache-Control: max-age=60, private
Expires: Wed, 01 Jan 2014 18:32:25 GMT
ac2a
<!DOCTYPE HTML>
<html lang="en-US">
<head>
<title>CNN.com International - Breaking, World, Business, Sports, Entertainment and Video News</title>
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<meta http-equiv="content-type" content="text/html;charset=utf-8"/>
<meta http-equiv="last-modified" content="2014-01-01T18:28:34Z"/>
<meta http-equiv="refresh" content="1800;url=http://edition.cnn.com/?refresh=1"/>
<meta name="robots" content="index,follow"/>
<meta name="googlebot" content="noarchive"/>
<meta name="description" content="CNN.com International delivers breaking news from across the globe and information on the latest top stories, business, sports and entertainment headlines. Follow the news as it happens through: special reports, videos, audio, photo galleries plus interactive maps and timelines."/>
<meta name="keywords" content="CNN, CNN news, CNN International, CNN International news, CNN Edition, Edition news, news, news online, breaking news, U.S. news, world news, global news, weather, business, CNN Money, sports, politics, law, technology, entertainment, education,
甚至不是源代码中的前13行...查看源代码: http : //edition.cnn.com/index.html
还有一个问题,当我尝试将google.com作为主机地址时
import socket
sock = socket.socket ( socket.AF_INET, socket.SOCK_STREAM )
sock.connect ( ( "google.com", 80 ) )
host = socket.gethostbyname("google.com")
sock.sendall('GET http://google.com/index.html HTTP/1.1\r\n'\
+ 'User-Agent: agent123\r\n'\
+ 'Host: '+host+'\r\n'\
+ '\r\n')
print sock.recv(4096)
sock.close()
我得到这个回应
HTTP/1.1 301 Moved Permanently
Location: http://www.google.com/index.html
Content-Type: text/html; charset=UTF-8
Date: Wed, 01 Jan 2014 18:38:57 GMT
Expires: Fri, 31 Jan 2014 18:38:57 GMT
Cache-Control: public, max-age=2592000
Server: gws
Content-Length: 229
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Alternate-Protocol: 80:quic
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/index.html">here</A>.
</BODY></HTML>
表示该页面已移至与我想要下载的地址相同的地址...
sock.recv(4096)
最多读取4096个字节; 它取决于已到达多少数据,调用实际上可以返回多少数据。 无法保证一口气就能读取4096个字节。
您必须继续从套接字读取,直到接收到所有数据:
data = ''
chunk = sock.recv(4096)
while chunk:
data += chunk
if len(data) >= 4096:
break
chunk = sock.recv(4096)
您对http://google.com/index.html
请求将重定向到另一个主机名www.google.com
。 相应地调整您的要求。
如果要实现一个完整的HTTP客户端,则必须解析状态行,通过解析Location:
标头并建立新连接以请求给您的新URL来处理301
重定向响应。
edition.cnn.com使用HTTP / 1.0,而www.google.com使用HTTP / 1.1。 也许有人可以了解如何分辨使用哪个。
适用于: www.google.com
import socket
import time
domain = 'www.google.com'
# must specify index.html for google
full_url = 'http://www.google.com/index.html'
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((domain, 80))
mysock.send('GET ' + full_url + ' HTTP/1.1\n\n')
while True:
data = mysock.recv(512)
time.sleep(2.0) # 2 second delay
if len(data) < 1:
break
print data
mysock.close()
适用于: edition.cnn.com
警告:大输出; 考虑将recv(512)调整为更大的数字或将time.sleep(2.0)更改为1秒。
import socket
import time
domain = 'cnn.com'
full_url = 'http://edition.cnn.com/'
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((domain, 80))
mysock.send('GET ' + full_url + ' HTTP/1.0\n\n')
while True:
data = mysock.recv(512)
time.sleep(2.0) # 2 second delay
if len(data) < 1:
break
print data
mysock.close()
两个过程均以退出代码0完成
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.