简体   繁体   English

Python套接字通过HTTP下载jpg

[英]Python Sockets Download jpg over HTTP

I am witnessing very strange behavior from my Python script. 我从Python脚本中看到了非常奇怪的行为。 I am using Python sockets to download an image from the web. 我正在使用Python套接字从Web下载图像。 I am not interested in using requests/urllib. 我对使用request / urllib不感兴趣。 When I try to download the image, it downloads successfully. 当我尝试下载图像时,它会成功下载。 However, when going to open the file in the Photos app, Windows spits back a "It looks like we don't support this file format" error. 但是,当要在“照片”应用中打开文件时,Windows会弹出“​​似乎我们不支持此文件格式”错误。

This is where the strange part starts. 这是奇怪的部分开始的地方。 If I copy and paste the URL that my socket is reaching out to (the one used to download the image, in this case http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg ) and download it myself from Chrome, and then run my script again, the image downloads and displays no problem! 如果我复制并粘贴套接字要访问的URL(用于下载映像的URL,在本例中为http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg ),然后从Chrome浏览器自己下载它,然后再次运行我的脚本,图像下载并显示没有问题! Also the number for Content-Length in the HTTP response headers increases. HTTP响应标头中Content-Length的数量也增加了。 I have done this 3 times with 3 different images and it has given me the same behavior each time. 我用3张不同的图像完成了3次,每次都赋予我相同的行为。 Below is two runs of my script, one before I downloaded the file from Chrome and one after. 以下是脚本的两次运行,一次是在我从Chrome下载文件之前,另一次是在运行之后。 Notice in the first run the Content-Length header states that there are 2564 bytes in the body of the response. 请注意,在第一次运行中,Content-Length标头指出响应主体中有2564个字节。 In the second run, this number changes to 3833. They are both requesting the same URL. 在第二次运行中,此数字更改为3833。它们都请求相同的URL。

PS D:\Documents\School\RIT\Classes\Summer 2018\CSEC 380\Homework\3\Script> python .\hw3-script.py
MESSAGE SENT
GET /gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//xAbuaitah.jpg.pagespeed.ic.PFwk87Pcno.jpg HTTP/1.1
Host: www.rit.edu
Accept: image/webp,image/apng,image/*,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate


ENTIRE MESSAGE RECEIVED
b'HTTP/1.1 200 OK\r\nDate: Sun, 12 Aug 2018 04:58:24 GMT\r\nServer: Apache\r\nLink: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"\r\nAccept-Ranges: bytes\r\nLast-Modified: Sun, 12 Aug 2018 02:06:23 GMT\r\nX-Original-Content-Length: 25378\r\nX-Content-Type-Options: nosniff\r\nExpires: Sun, 12 Aug 2018 02:11:23 GMT\r\nCache-Control: max-age=300,private\r\nContent-Length: 2564\r\nConnection: close\r\nContent-Type: image/webp\r\n\r\nRIFF\xfc\t\...<hex data here>...\x00\x00'

RESPONSE HEADERS SPLIT OFF
HTTP/1.1 200 OK
Date: Sun, 12 Aug 2018 04:58:24 GMT
Server: Apache
Link: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"
Accept-Ranges: bytes
Last-Modified: Sun, 12 Aug 2018 02:06:23 GMT
X-Original-Content-Length: 25378
X-Content-Type-Options: nosniff
Expires: Sun, 12 Aug 2018 02:11:23 GMT
Cache-Control: max-age=300,private
Content-Length: 2564
Connection: close
Content-Type: image/webp

IMAGE BINARY DATA SPLIT OFF
b'RIFF\xfc\t\...<hex data here>...\x00\x00'

Bytes in image data: 2581

PS D:\Documents\School\RIT\Classes\Summer 2018\CSEC 380\Homework\3\Script> python .\hw3-script.py
MESSAGE SENT
GET /gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//xAbuaitah.jpg.pagespeed.ic.PFwk87Pcno.jpg HTTP/1.1
Host: www.rit.edu
Accept: image/webp,image/apng,image/*,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate


ENTIRE MESSAGE RECEIVED
b'HTTP/1.1 200 OK\r\nDate: Sun, 12 Aug 2018 04:59:08 GMT\r\nServer: Apache\r\nLink: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"\r\nX-Content-Type-Options: nosniff\r\nAccept-Ranges: bytes\r\nExpires: Mon, 12 Aug 2019 04:58:50 GMT\r\nCache-Control: max-age=31536000\r\nEtag: W/"0"\r\nLast-Modified: Sun, 12 Aug 2018 04:58:50 GMT\r\nX-Original-Content-Length: 25378\r\nContent-Length: 3833\r\nConnection: close\r\nContent-Type: image/jpeg\r\n\r\n\xff\xd8\...<hex data here>...\xff\xd9'

RESPONSE HEADERS SPLIT OFF
HTTP/1.1 200 OK
Date: Sun, 12 Aug 2018 04:59:08 GMT
Server: Apache
Link: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"
X-Content-Type-Options: nosniff
Accept-Ranges: bytes
Expires: Mon, 12 Aug 2019 04:58:50 GMT
Cache-Control: max-age=31536000
Etag: W/"0"
Last-Modified: Sun, 12 Aug 2018 04:58:50 GMT
X-Original-Content-Length: 25378
Content-Length: 3833
Connection: close
Content-Type: image/jpeg

IMAGE BINARY DATA SPLIT OFF
b'\xff\xd8\...<hex data here>...\xff\xd9'

Bytes in image data: 3850

Here is my code 这是我的代码

class MySocket:

    def __init__(self, sock=None):
        if sock is None:
            self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        else:
            self.sock = sock

    def connect(self, host, port):
        self.sock.connect((host, port))

    def myclose(self):
        self.sock.close()

    def mysend(self, msg, debug=False):
        if debug:
            print("MESSAGE SENT")
            print(msg.decode())
        self.sock.sendall(msg)

    def myreceive(self, debug=False):
        received = b''
        buffer = 1
        while True:
            part = self.sock.recv(buffer)
            received += part
            if part == b'':
                break
        if debug:
            print("Received...")
            print(received)
        return received

def download_image(img_url):
    """
    Download images with the given socket and list of urls
    :param img_url: url corresponding to an image
    :return: None
    """
    image_socket = MySocket()
    image_socket.connect("www.rit.edu", 80)
    message = "GET " + img_url + " HTTP/1.1\r\n" \
              "Host: www.rit.edu\r\n" \
              "Accept: image/webp,image/apng,image/*,*/*;q=0.8\r\n" \
              "Accept-Language: en-US,en;q=0.9\r\n" \
              "Accept-Encoding: gzip, deflate\r\n\r\n"
    image_socket.mysend(message.encode(), True)
    reply = image_socket.myreceive()
    print("ENTIRE MESSAGE RECEIVED")
    print(reply)
    print()
    headers = reply.split(b'\r\n\r\n')[0]

    print("RESPONSE HEADERS SPLIT OFF")
    print(headers.decode())
    image = reply[len(headers)+4:]
    print()

    print("IMAGE BINARY DATA SPLIT OFF")
    print(image)
    print()
    print("Bytes in image data:", sys.getsizeof(image))
    print()
    # print(type(image))
    img_name = str(len(os.listdir("D:\\Documents\\School\\RIT\\Classes\\Summer 2018\\CSEC 380\\Homework\\3\\Script\\act1step2images"))) + img_url[-4:]
    f = open(os.path.join("D:\\Documents\\School\\RIT\\Classes\\Summer 2018\\CSEC 380\\Homework\\3\\Script\\act1step2images", img_name), 'wb')
    f.write(image)
    f.close()

def main():
    download_image("http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg")

main()

Can anyone tell me what is going on and why the jpg does not download on the first try? 谁能告诉我这是怎么回事,为什么第一次尝试jpg无法下载?

This is part of the request you sent: 这是您发送的请求的一部分:

Accept: image/webp,image/apng,image/*,*/*;q=0.8

It states that you prefer to get a response in image/webp content type before any other image/* type. 它指出您希望在任何其他image/*类型之前获得image/webp内容类型的响应。 And thus you get WEBP image in your response: 因此,您将在响应中得到WEBP图像:

HTTP/1.1 200 OK
...
Content-Length: 2564
...
Content-Type: image/webp
...
b'RIFF\xfc\t\...<hex data here>...\x00\x00'

The next time you sent the same request you get instead a different response: 下次发送相同的请求时,您得到的是不同的响应:

HTTP/1.1 200 OKheaders
...
Content-Length: 3833
...
Content-Type: image/jpeg
...
b'\xff\xd8\...<hex data here>...\xff\xd9'

This time you don't get a WEBP image but a JPEG image back which can be seen both in the Content-Type header and the response body. 这次您没有得到WEBP图像,而是得到了JPEG图像,可以在Content-Type标头和响应正文中看到。

I'm not completely sure why this is the case but I assume that the previous request from Chrome made the server create the JPEG image from the original source file and cache it locally for later requests so that it is now cheaper for the server to serve the pre-created JPEG file instead to newly create a WEBP file. 我不确定为什么会这样,但我认为Chrome先前的请求使服务器从原始源文件创建了JPEG图像,并将其本地缓存以供以后的请求使用,因此服务器现在可以更便宜地使用而是使用预先创建的JPEG文件来重新创建WEBP文件。 And your Accept header stated that you support both formats. 并且您的Accept标头表明您支持两种格式。

Anyway, if your code does not support WEBP but only JPEG then you should not claim to be able to deal with WEBP in your Accept header. 无论如何,如果您的代码不支持WEBP,而仅支持JPEG,则您不应在Accept标头中声明能够处理WEBP。 Instead you should only claim what you really support, ie 相反,您应该只声​​明您真正支持的内容,即

Accept: image/jpeg

Same is also true with other information you send in the request. 您在请求中发送的其他信息也是如此。 For example you claim to support compressed response by sending Accept-Encoding: gzip, deflate but your code has no support to deal with a compressed response. 例如,您声称通过发送Accept-Encoding: gzip, deflate支持压缩响应Accept-Encoding: gzip, deflate但是您的代码不支持处理压缩响应。 Similar you are claiming to be able to deal with chunked transfer encoding and HTTP keep alive by sending a HTTP/1.1 request but your code has no support for any of these features either. 类似地,您声称能够通过发送HTTP/1.1请求来处理分块的传输编码和HTTP保持活动,但是您的代码也不支持任何这些功能。

In summary you should probably send only this request to get what you want: 总之,您应该只发送此请求即可获得所需的内容:

GET /.... HTTP/1.0
Host: www.rit.edu
Accept: image/jpeg

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM