简体   繁体   English

使用Python请求从URL保存图像-URL类型错误

[英]Saving Image from URL using Python Requests - URL type error

Using the following code: 使用以下代码:

    with open('newim','wb') as f:
        f.write(requests.get(repr(url)))

where the url is: 网址在哪里:

    url = ''

I get the following error: 我收到以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python33\lib\site-packages\requests\api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python33\lib\site-packages\requests\api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "C:\Python33\lib\site-packages\requests\sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python33\lib\site-packages\requests\sessions.py", line 567, in send
    adapter = self.get_adapter(url=request.url)
  File "C:\Python33\lib\site-packages\requests\sessions.py", line 641, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)

I have seen other posts with what, at first glance, appears to be a similar problem but I haven't had any luck just adding 'https://' or anything like that...I seriously want to avoid having to do this in webdriver+Autoit or something because I have to do a similar exercise for thousands of images. 我看过其他帖子,乍看之下似乎是一个类似的问题,但我没有运气,只是添加了“ https://”或类似的内容...我很想避免这样做在webdriver + Autoit之类的程序中,因为我必须对数千张图像进行类似的练习。

This is an image encoded in base64. 这是以base64编码的图像。 Quoting the URL below: "base64 equals to text (string) representation of the image itself". 在下面引用URL:“ base64等于图像本身的文本(字符串)表示形式”。

Read this for a detailed explanation: http://www.stoimen.com/blog/2009/04/23/when-you-should-use-base64-for-images/ 请阅读以下内容以获得详细说明: http : //www.stoimen.com/blog/2009/04/23/when-you-should-use-base64-for-images/

In order to use them you'll have to implement a base64 decoder. 为了使用它们,您必须实现base64解码器。 Luckily SO already provides you with the answer on how to do it: 幸运的是,SO已经为您提供了解决方法:

Python base64 data decode Python base64数据解码

There seems to be a problem with your understanding of the concept of embedded images . 您对嵌入式图像概念的理解似乎有问题。 The url you have posted is, actually, what your browser returns when you select 'View Image' or 'Copy Image Location' (or something similar, depending on the browser) from the context menu, and formally is called a data URI . 实际上,您发布的url是从上下文菜单中选择“查看图像”“复制图像位置” (或类似的名称,具体取决于浏览器)时浏览器返回的内容,并正式称为数据URI

It is not an http url pointing to an image, and you can not use it to retrieve actual images from any server: this is exactly what requests points out in the error message. 不是一个HTTP URL指向一个图片,你不能用它来从任何服务器检索的实际图像:这正是requests错误消息指出。


So, how do we get these images? 那么,我们如何得到这些图片? The following script will handle this task: 以下脚本将处理此任务:

import requests
from lxml import html
import binascii as ba

i = 0
url="<Page URL goes here>" #Ex: http://server/dir/images.html
page = requests.get(url)
struct = html.fromstring(page.text)
images = struct.xpath('//img/@src')

for img in images:
    i += 1
    ext = img.partition('data:image/')[2].split(';')[0]
    with open('newim'+str(i)+'.'+ext,'wb') as f:
        f.write(ba.a2b_base64(img.partition('base64,')[2]))

print("Done")

To run it you will need to install, along with requests , the lxml library which can be found here . 要运行它,您将需要与requests一起安装lxml库,该库可在此处找到。


Here follows a short description of how the script functions: 以下是脚本功能的简短描述:

First it requests the url from the server and, after it gets the server's response, it stores it in a Response object ( page ). 首先,它从服务器请求url ,然后在获得服务器的响应后,将其存储在Response对象page )中。

Then it utilizes html.fromstring() from lxml to transform the "textified" content of page into a tree-structure which can be processed by commands utilizing XPath syntax, like this one: images = struct.xpath('//img/@src') . 然后,它利用来自lxml的 html.fromstring()page的“文本化”内容转换为树形结构,可以通过使用XPath语法的命令来对其进行处理,如下所示: images = struct.xpath('//img/@src')

The result is a list containing the contents of the src attribute of every image in the page. 结果是一个list其中包含页面中每个图像的src属性的内容。 In this case (embedded images) these are the data URIs. 在这种情况下(嵌入式图像),这些是数据URI。

Then, for every image in the list, it first gets the image type (which will be used as the newim 's extension), using partition() and split() and stores it in ext . 然后,对于列表中的每个图像,它首先使用partition()split()获取图像类型(将用作newim的扩展名split()并将其存储在ext Then it converts the base64 encoded data to binary (using a2b_base64() from binascii module) and writes the output to the file. 然后,它将base64编码的数据转换为二进制数据(使用binascii模块中的a2b_base64() )并将输出写入文件。


As a small demo, save this html code (as, eg, images.html ) somewhere in your server 作为一个小型演示,请将此html代码(例如, images.html )保存在服务器中的某个位置

<h1>Images</h1>
<img src="" />  
<br />
<img src=""></img>
<br />
<img src=""/>

and point to it in the script: requests.get("http://yourserver/somedir/images.html") . 并在脚本中指向它: requests.get("http://yourserver/somedir/images.html")

When you run the script you will get the following 3 images: 运行脚本时,您将获得以下3张图像: 在此处输入图片说明 , 在此处输入图片说明 , 在此处输入图片说明 , respectively named newim1.png , newim2.png and newim3.jpg . ,分别命名为newim1.pngnewim2.pngnewim3.jpg


As a reminder, do note that this script (in its current form) will only handle embedded images . 提醒一下,请注意,此脚本(以当前形式)将仅处理嵌入式图像 If you want to process also ordinary linked images, then you have to modify it accordingly (but this is not difficult). 如果还要处理普通的链接图像,则必须进行相应的修改(但这并不困难)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM