Python-使用套接字获取源代码

Question

我想发送http get请求并从网页接收源代码，这必须通过套接字来完成。 我将缓冲区大小设置为4096，但是我的脚本仅从页面下载一小部分

import socket
sock = socket.socket ( socket.AF_INET, socket.SOCK_STREAM )
sock.connect ( ( "edition.cnn.com", 80 ) )

host = socket.gethostbyname("edition.cnn.com")
sock.sendall('GET http://edition.cnn.com/index.html HTTP/1.1\r\n'\
    + 'User-Agent: agent123\r\n'\
    + 'Host: '+host+'\r\n'\
    + '\r\n')

print sock.recv(4096)
sock.close()

运行此代码数据后，我得到的是

HTTP/1.1 200 OK

Server: nginx

Date: Wed, 01 Jan 2014 18:31:25 GMT

Content-Type: text/html

Transfer-Encoding: chunked

Connection: keep-alive

Set-Cookie: CG=GR:44:Réthimnon; path=/

Last-Modified: Wed, 01 Jan 2014 18:31:22 GMT

Vary: Accept-Encoding

Cache-Control: max-age=60, private

Expires: Wed, 01 Jan 2014 18:32:25 GMT



ac2a


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<title>CNN.com International - Breaking, World, Business, Sports, Entertainment and Video News</title>
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<meta http-equiv="content-type" content="text/html;charset=utf-8"/>
<meta http-equiv="last-modified" content="2014-01-01T18:28:34Z"/>
<meta http-equiv="refresh" content="1800;url=http://edition.cnn.com/?refresh=1"/>
<meta name="robots" content="index,follow"/>
<meta name="googlebot" content="noarchive"/>
<meta name="description" content="CNN.com International delivers breaking news from across the globe and information on the latest top stories, business, sports and entertainment headlines. Follow the news as it happens through: special reports, videos, audio, photo galleries plus interactive maps and timelines."/>
<meta name="keywords" content="CNN, CNN news, CNN International, CNN International news, CNN Edition, Edition news, news, news online, breaking news, U.S. news, world news, global news, weather, business, CNN Money, sports, politics, law, technology, entertainment, education,

甚至不是源代码中的前13行...查看源代码： http : //edition.cnn.com/index.html

还有一个问题，当我尝试将google.com作为主机地址时

import socket
sock = socket.socket ( socket.AF_INET, socket.SOCK_STREAM )
sock.connect ( ( "google.com", 80 ) )

host = socket.gethostbyname("google.com")
sock.sendall('GET http://google.com/index.html HTTP/1.1\r\n'\
    + 'User-Agent: agent123\r\n'\
    + 'Host: '+host+'\r\n'\
    + '\r\n')
print sock.recv(4096)
sock.close()

我得到这个回应

HTTP/1.1 301 Moved Permanently

Location: http://www.google.com/index.html

Content-Type: text/html; charset=UTF-8

Date: Wed, 01 Jan 2014 18:38:57 GMT

Expires: Fri, 31 Jan 2014 18:38:57 GMT

Cache-Control: public, max-age=2592000

Server: gws

Content-Length: 229

X-XSS-Protection: 1; mode=block

X-Frame-Options: SAMEORIGIN

Alternate-Protocol: 80:quic



<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/index.html">here</A>.

</BODY></HTML>

表示该页面已移至与我想要下载的地址相同的地址...

Answer 1

sock.recv(4096) 最多读取4096个字节; 它取决于已到达多少数据，调用实际上可以返回多少数据。 无法保证一口气就能读取4096个字节。

您必须继续从套接字读取，直到接收到所有数据：

data = ''
chunk = sock.recv(4096)
while chunk:
    data += chunk
    if len(data) >= 4096:
        break
    chunk = sock.recv(4096)

您对http://google.com/index.html请求将重定向到另一个主机名www.google.com 。 相应地调整您的要求。

如果要实现一个完整的HTTP客户端，则必须解析状态行，通过解析Location:标头并建立新连接以请求给您的新URL来处理301重定向响应。

Answer 2

edition.cnn.com使用HTTP / 1.0，而www.google.com使用HTTP / 1.1。 也许有人可以了解如何分辨使用哪个。

适用于： www.google.com

import socket
import time

domain = 'www.google.com'
# must specify index.html for google
full_url = 'http://www.google.com/index.html'


mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((domain, 80))
mysock.send('GET ' + full_url + ' HTTP/1.1\n\n')

while True:
    data = mysock.recv(512)
    time.sleep(2.0)     # 2 second delay
    if len(data) < 1:
        break
    print data

mysock.close()

适用于： edition.cnn.com

警告：大输出； 考虑将recv（512）调整为更大的数字或将time.sleep（2.0）更改为1秒。

import socket
import time

domain = 'cnn.com'
full_url = 'http://edition.cnn.com/'

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((domain, 80))
mysock.send('GET ' + full_url + ' HTTP/1.0\n\n')

while True:
    data = mysock.recv(512)
    time.sleep(2.0)     # 2 second delay
    if len(data) < 1:
        break
    print data

mysock.close()

两个过程均以退出代码0完成

Python-使用套接字获取源代码

问题描述

2 个解决方案

解决方案1
3 2014-01-01 18:47:41

解决方案2
0 2015-11-03 22:17:40

Python-使用套接字获取源代码

问题描述

2 个解决方案

解决方案1 3 2014-01-01 18:47:41

解决方案2 0 2015-11-03 22:17:40

解决方案1
3 2014-01-01 18:47:41

解决方案2
0 2015-11-03 22:17:40