简体   繁体   English

使用python和beautifulsoup下载图像

[英]downloading images using python and beautifulsoup

I'm trying to download images using the below code and got a error 我正在尝试使用以下代码下载图像,但出现错误

from bs4 import BeautifulSoup 
import requests
import re 
import urllib 
import urllib.request as ur 
import os 
import http.cookiejar as cookielib 
import json

def get_soup(url,header):
    return BeautifulSoup(ur.urlopen(ur.Request(url,headers=header)),'html.parser')


query = 'apple'   #you can change the query for the image  here
image_type="ActiOn" query= query.split() query='+'.join(query)
url="https://www.google.co.in/search?q="+query+"&source=lnms&tbm=isch"
print (url)
#add the directory for your image here 
DIR="/Users/jashuvadoma/Desktop/hacking/images"
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"} 
soup = get_soup(url,header)


ActualImages=[] # contains the link for Large original images, type of image 
for a in soup.find_all("div",{"class":"rg_meta"}):
    link , Type =json.loads(a.text)["ou"], json.loads(a.text)["ity"]
    ActualImages.append((link,Type))
print ("there are total" , len(ActualImages),"images")

if not os.path.exists(DIR):
    os.mkdir(DIR) DIR = os.path.join(DIR, query.split()[0])

if not os.path.exists(DIR):
    os.mkdir(DIR)
###print images 
for i , (img , Type) in enumerate( ActualImages):
    try:
        req = ur.Request(img, headers={'User-Agent' : header})
        raw_img = ur.urlopen(req).read()
        cntr = lea([i for i in os.listdir(DIR) if image_type in i]) + 1
        print (cntr)
        if len(Type)==0:
            f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+".jpg"), 'wb')
        else :
            f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+"."+Type), 'wb')
        f.write(raw_img)
        f.close()
    except Exception as e:
        print ("could not load : "+img)
        print (e)

Error as follows: https://www.google.co.in/search?q=apple&source=lnms&tbm=isch there are total 100 images could not load : https://www.apple.com/ac/structured-data/images/knowledge_graph_logo.png?201606271147 expected string or bytes-like object 错误如下: https : //www.google.co.in/search?q=apple&source=lnms&tbm=isch总共有100张图像无法加载: https : //www.apple.com/ac/structured-data/ images / knowledge_graph_logo.png?201606271147预期的字符串或类似字节的对象

Error clearly indicates that some parameter need to a string value but something else is passed. 错误清楚地表明某些参数需要字符串值,但传递了其他参数。

Before posting a question you should try to debug it by yourself. 在发布问题之前,您应该尝试自己调试。 Few things you can try: 您可以尝试的几种方法:

  1. Don't catch too broad exceptions. 不要捕获太多的例外。 With proper exception handling, you can easily trace: 通过适当的异常处理,您可以轻松跟踪:
/usr/lib/python3.6/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
   1278 
   1279         for hdr, value in headers.items():
-> 1280             self.putheader(hdr, value)
   1281         if isinstance(body, str):
   1282             # RFC 2616 Section 3.7.1 says that text default has a

/usr/lib/python3.6/http/client.py in putheader(self, header, *values)
   1214                 values[i] = str(one_value).encode('ascii')
   1215 
-> 1216             if _is_illegal_header_value(values[i]):
   1217                 raise ValueError('Invalid header value %r' % (values[i],))
   1218 

TypeError: expected string or bytes-like object

Now looking at the trace it seems that an header value is wrong. 现在查看跟踪,似乎标头值是错误的。

  1. Adding proper log statements. 添加适当的日志语句。 Logging the header value it appeared to be a dict rather than string. 记录标题值似乎是字典而不是字符串。

In print images section change headers as follows: 在打印图像部分中,更改标题,如下所示:

req = ur.Request(img, headers=header)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM