简体   繁体   English

我无法使用 python selenium 下载谷歌图片

[英]I can't download google images using python selenium

Hi I'm crawling a google image using selenium.嗨,我正在使用硒抓取谷歌图片。 But it didn't work well.但效果并不好。 How can I get this code to work?我怎样才能让这个代码工作? My code is like below.我的代码如下。

Previously, I used google_images_download and suddenly got stuck.之前用google_images_download,突然卡住了。 So I'm looking for a new way and I hope someone can help Thank you所以我正在寻找一种新的方法,我希望有人可以提供帮助谢谢

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import json
import os
import urllib.request as urllib2
import argparse

searchterm = 'spider' # will also be the name of the folder
url = "https://www.google.co.in/search?q="+searchterm+"&source=lnms&tbm=isch"
# NEED TO DOWNLOAD CHROMEDRIVER, insert path to chromedriver inside parentheses in following line
browser = webdriver.Chrome('C:\Python27\Scripts\chromedriver')
browser.get(url)
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
counter = 0
succounter = 0

if not os.path.exists(searchterm):
    os.mkdir(searchterm)

for _ in range(500):
    browser.execute_script("window.scrollBy(0,10000)")

for x in browser.find_elements_by_xpath('//div[contains(@class,"rg_meta")]'):
    counter = counter + 1
    print("Total Count:", counter)
    print("Succsessful Count:", succounter)
    print("URL:",json.loads(x.get_attribute('innerHTML'))["ou"])

    img = json.loads(x.get_attribute('innerHTML'))["ou"]
    imgtype = json.loads(x.get_attribute('innerHTML'))["ity"]
    try:
        req = urllib2.Request(img, headers={'User-Agent': header})
        raw_img = urllib2.urlopen(req).read()
        File = open(os.path.join(searchterm , searchterm + "_" + str(counter) + "." + imgtype), "wb")
        File.write(raw_img)
        File.close()
        succounter = succounter + 1
    except:
            print("can't get img")

print (succounter, "pictures succesfully downloaded")
browser.close()

我决定从 Google 图片以外的其他网站抓取图片。

I also faced problem with crawling images from google like your method as using rg_meta我也遇到了从谷歌抓取图片的问题,就像你使用rg_meta方法一样

google image search result webpage source code has been changed and they don't give rg_meta anymore since the beginning of 2020.谷歌图片搜索结果网页源代码已更改,自 2020 年初rg_meta不再提供rg_meta

rg_meta tag also changed to randomly string like rg_meta标签也改为随机字符串,如

rg_X XXXXXX XXXXXX

I think Google set about to ban crawling bots and leading to use Google Custom Search APIs.我认为谷歌开始禁止爬行机器人并导致使用谷歌自定义搜索 API。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM