简体   繁体   English

在网页上抓取一个jpg文件,然后使用python保存

[英]Scrape a jpg file on webpage, then saving it using python

OK I'm trying to scrape jpg image from Gucci website. 确定,我正在尝试从Gucci网站刮取jpg图像。 Take this one as example. 以这个为例。

http://www.gucci.com/images/ecommerce/styles_new/201501/web_full/277520_F4CYG_4080_001_web_full_new_theme.jpg http://www.gucci.com/images/ecommerce/styles_new/201501/web_full/277520_F4CYG_4080_001_web_full_new_theme.jpg

I tried urllib.urlretrieve, which doesn't work becasue Gucci blocked the function. 我尝试使用urllib.urlretrieve,但由于Gucci阻止了该功能而无法正常工作。 So I wanted to use requests to scrape the source code for the image and then write it into a .jpg file. 因此,我想使用请求来抓取图像的源代码,然后将其写入.jpg文件。

image = requests.get("http://www.gucci.com/images/ecommerce/styles_new/201501/web_full/277520_F4CYG_4080_001_web_full_new_theme.jpg").text.encode('utf-8')

I encoded it because if I don't, it keeps telling me that gbk cannot encode the string. 我对它进行编码是因为,如果不这样做,它将不断告诉我gbk无法对字符串进行编码。

Then: 然后:

with open('1.jpg', 'wb') as f:
    f.write(image)

looks good right? 看起来不错吧? But the result is -- the jpg file cannot be opened. 但是结果是-无法打开jpg文件。 There's no image! 没有图片! Windows tells me the jpg file is damaged. Windows告诉我jpg文件已损坏。

What could be the problem? 可能是什么问题呢?

  1. I'm thinking that maybe when I scraped the image, I lost some information, or some characters are wrongly scraped. 我在想,也许当我刮取图像时,我丢失了一些信息,或者某些字符被错误地刮取了。 But how can I find out which? 但是我如何找出哪个呢?

  2. I'm thinking that maybe some information is lost via encoding. 我在想,也许某些信息会通过编码丢失。 But if I don't encode, I cannot even print it, not to mention writing it into a file. 但是,如果我不编码,我什至无法打印它,更不用说将其写入文件了。

What could go wrong? 可能出什么问题了?

I am not sure about the purpose of your use of encode . 我不确定您使用encode的目的。 You're not working with text, you're working with an image. 您不是在处理文本,而是在处理图像。 You need to access the response as binary data, not as text, and use image manipulation functions rather than text ones. 您需要以二进制数据而非文本形式访问响应,并使用图像处理功能而不是文本功能。 Try this: 尝试这个:

from PIL import Image
from io import BytesIO
import requests

response = requests.get("http://www.gucci.com/images/ecommerce/styles_new/201501/web_full/277520_F4CYG_4080_001_web_full_new_theme.jpg")
bytes = BytesIO(response.content)
image = Image.open(bytes)
image.save("1.jpg")

Note the use of response.content instead of response.text . 注意使用response.content而不是response.text You will need to have PIL or Pillow installed to use the Image module. 您需要安装PIL或Pillow才能使用Image模块。 BytesIO is included in Python 3. BytesIO包含在Python 3中。

Or you can just save the data straight to disk without looking at what's inside: 或者,您可以直接将数据保存到磁盘,而无需查看内部内容:

import requests
response = requests.get("http://www.gucci.com/images/ecommerce/styles_new/201501/web_full/277520_F4CYG_4080_001_web_full_new_theme.jpg")
with open('1.jpg','wb') as f:
    f.write(response.content)

A JPEG file is not text, it's binary data. JPEG文件不是文本,而是二进制数据。 So you need to use the request.content attribute to access it. 因此,您需要使用request.content属性来访问它。

The code below also includes a get_headers() function, which can be handy when you're exploring a Web site. 下面的代码还包括get_headers()函数,当您浏览网站时,该函数很方便。

import requests

def get_headers(url):
    resp = requests.head(url)
    print("Status: %d" % resp.status_code)
    resp.raise_for_status()
    for t in resp.headers.items():
        print('%-16s : %s' % t)

def download(url, fname):
    ''' Download url to fname '''
    print("Downloading '%s' to '%s'" % (url, fname))
    resp = requests.get(url)
    resp.raise_for_status()
    with open(fname, 'wb') as f:
        f.write(resp.content)

def main():
    site = 'http://www.gucci.com/images/ecommerce/styles_new/201501/web_full/'
    basename = '277520_F4CYG_4080_001_web_full_new_theme.jpg'
    url = site + basename
    fname = 'qtest.jpg'

    try:
        #get_headers(url)
        download(url, fname)
    except requests.exceptions.HTTPError as e:
        print("%s '%s'" % (e, url))

if __name__ == '__main__':
    main()

We call the .raise_for_status() method so that get_headers() and download() raise an Exception if something goes wrong; 我们调用.raise_for_status()方法,以便在出现问题时get_headers()download()引发Exception; we catch the Exception in main() and print the relevant info. 我们在main()捕获到Exception并打印相关信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM