简体   繁体   English

如何在Python中下载谷歌图片搜索结果

[英]How to download google image search results in Python

This question has been asked numerous times before, but all answers are at least a couple years old and currently based on the ajax.googleapis.com API, which is no longer supported.这个问题之前已经被问过很多次了,但所有的答案都至少有几年的历史了,目前基于 ajax.googleapis.com API,它不再受支持。

Does anyone know of another way?有谁知道另一种方式? I'm trying to download a hundred or so search results, and in addition to Python APIs I've tried numerous desktop, browser-based, or browser-addon programs for doing this which all failed.我正在尝试下载一百个左右的搜索结果,除了 Python API 之外,我还尝试了许多桌面、基于浏览器或浏览器插件的程序来执行此操作,但都失败了。

Use the Google Custom Search for what you want to achieve.使用Google 自定义搜索来实现您的目标。 See @i08in's answer of Python - Download Images from google Image search?请参阅@i08inPython回答- 从谷歌图片搜索下载图片? it has great description, script samples and libraries references.它有很好的描述、脚本示例和库参考。

To download any number of images from Google image search using Selenium:要使用 Selenium 从 Google 图片搜索中下载任意数量的图片:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import os
import json
import urllib2
import sys
import time

# adding path to geckodriver to the OS environment variable
# assuming that it is stored at the same path as this script
os.environ["PATH"] += os.pathsep + os.getcwd()
download_path = "dataset/"

def main():
    searchtext = sys.argv[1] # the search query
    num_requested = int(sys.argv[2]) # number of images to download
    number_of_scrolls = num_requested / 400 + 1 
    # number_of_scrolls * 400 images will be opened in the browser

    if not os.path.exists(download_path + searchtext.replace(" ", "_")):
        os.makedirs(download_path + searchtext.replace(" ", "_"))

    url = "https://www.google.co.in/search?q="+searchtext+"&source=lnms&tbm=isch"
    driver = webdriver.Firefox()
    driver.get(url)

    headers = {}
    headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
    extensions = {"jpg", "jpeg", "png", "gif"}
    img_count = 0
    downloaded_img_count = 0

    for _ in xrange(number_of_scrolls):
        for __ in xrange(10):
            # multiple scrolls needed to show all 400 images
            driver.execute_script("window.scrollBy(0, 1000000)")
            time.sleep(0.2)
        # to load next 400 images
        time.sleep(0.5)
        try:
            driver.find_element_by_xpath("//input[@value='Show more results']").click()
        except Exception as e:
            print "Less images found:", e
            break

    # imges = driver.find_elements_by_xpath('//div[@class="rg_meta"]') # not working anymore
    imges = driver.find_elements_by_xpath('//div[contains(@class,"rg_meta")]')
    print "Total images:", len(imges), "\n"
    for img in imges:
        img_count += 1
        img_url = json.loads(img.get_attribute('innerHTML'))["ou"]
        img_type = json.loads(img.get_attribute('innerHTML'))["ity"]
        print "Downloading image", img_count, ": ", img_url
        try:
            if img_type not in extensions:
                img_type = "jpg"
            req = urllib2.Request(img_url, headers=headers)
            raw_img = urllib2.urlopen(req).read()
            f = open(download_path+searchtext.replace(" ", "_")+"/"+str(downloaded_img_count)+"."+img_type, "wb")
            f.write(raw_img)
            f.close
            downloaded_img_count += 1
        except Exception as e:
            print "Download failed:", e
        finally:
            print
        if downloaded_img_count >= num_requested:
            break

    print "Total downloaded: ", downloaded_img_count, "/", img_count
    driver.quit()

if __name__ == "__main__":
    main()

Full code is here .完整代码在这里

Make sure you install icrawler library first, use.确保先安装 icrawler 库,使用。

pip install icrawler
from icrawler.builtin import GoogleImageCrawler
google_Crawler = GoogleImageCrawler(storage = {'root_dir': r'write the name of the directory you want to save to here'})
google_Crawler.crawl(keyword = 'sad human faces', max_num = 800)

How about this one?这个怎么样?

https://github.com/hardikvasa/google-images-download https://github.com/hardikvasa/google-images-download

it allows you to download hundreds of images and has a ton of filters to choose from to customize your search它允许您下载数百张图片,并有大量过滤器可供选择以自定义您的搜索


If you would want to download more than 100 images per keyword , then you will need to install 'selenium' along with 'chromedriver'.如果您想为每个关键字下载 100 张以上的图像,那么您需要安装 'selenium' 和 'chromedriver'。

If you have pip installed the library or run the setup.py file, Selenium would have automatically installed on your machine.如果您已经 pip 安装了该库或运行 setup.py 文件,Selenium 将自动安装在您的机器上。 You will also need Chrome browser on your machine.您的机器上还需要 Chrome 浏览器。 For chromedriver:对于 chromedriver:

Download the correct chromedriver based on your operating system.根据您的操作系统下载正确的 chromedriver。

On Windows or MAC if for some reason the chromedriver gives you trouble, download it under the current directory and run the command.在 Windows 或 MAC 上,如果 chromedriver 由于某种原因给您带来麻烦,请将其下载到当前目录下并运行该命令。

On windows however, the path to chromedriver has to be given in the following format:然而,在 Windows 上,chromedriver 的路径必须以以下格式给出:

C:\\complete\\path\\to\\chromedriver.exe C:\\complete\\path\\to\\chromedriver.exe

On Linux if you are having issues installing google chrome browser, refer to this CentOS or Amazon Linux Guide or Ubuntu Guide在 Linux 上,如果您在安装 google chrome 浏览器时遇到问题,请参阅此 CentOS 或 Amazon Linux 指南或 Ubuntu 指南

For All the operating systems you will have to use '--chromedriver' or '-cd' argument to specify the path of chromedriver that you have downloaded in your machine.对于所有操作系统,您都必须使用“--chromedriver”或“-cd”参数来指定您在计算机中下载的 chromedriver 的路径。

Improving a bit on Ravi Hirani's answer the simplest way is to go by this :改进 Ravi Hirani 的答案,最简单的方法是这样做:

from icrawler.builtin import GoogleImageCrawler

google_crawler = GoogleImageCrawler(storage={'root_dir': 'D:\\projects\\data core\\helmet detection\\images'})
google_crawler.crawl(keyword='cat', max_num=100)

Source : https://pypi.org/project/icrawler/来源: https : //pypi.org/project/icrawler/

i have been using this script to download images from google search and i have been using them for my trainig my classifiers the code below can download 100 images related to the query我一直在使用这个脚本从谷歌搜索下载图像,我一直在使用它们来训练我的分类器 下面的代码可以下载 100 张与查询相关的图像

from bs4 import BeautifulSoup
import requests
import re
import urllib2
import os
import cookielib
import json

def get_soup(url,header):
    return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),'html.parser')


query = raw_input("query image")# you can change the query for the image  here
image_type="ActiOn"
query= query.split()
query='+'.join(query)
url="https://www.google.co.in/search?q="+query+"&source=lnms&tbm=isch"
print url
#add the directory for your image here
DIR="Pictures"
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"
}
soup = get_soup(url,header)


ActualImages=[]# contains the link for Large original images, type of  image
for a in soup.find_all("div",{"class":"rg_meta"}):
    link , Type =json.loads(a.text)["ou"]  ,json.loads(a.text)["ity"]
    ActualImages.append((link,Type))

print  "there are total" , len(ActualImages),"images"

if not os.path.exists(DIR):
            os.mkdir(DIR)
DIR = os.path.join(DIR, query.split()[0])

if not os.path.exists(DIR):
            os.mkdir(DIR)
###print images
for i , (img , Type) in enumerate( ActualImages):
    try:
        req = urllib2.Request(img, headers={'User-Agent' : header})
        raw_img = urllib2.urlopen(req).read()

        cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
        print cntr
        if len(Type)==0:
            f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+".jpg"), 'wb')
        else :
            f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+"."+Type), 'wb')


        f.write(raw_img)
        f.close()
    except Exception as e:
        print "could not load : "+img
        print e

I'm trying this library that can be used as both: a command line tool or a python library.我正在尝试这个可以用作两者的:命令行工具或 python 库。 It has lots of arguments to find images with different criterias.它有很多论据来查找具有不同标准的图像。

Those are examples taken from its documentation, to use it as a python library:这些是从其文档中获取的示例,将其用作 python 库:

from google_images_download import google_images_download   #importing the library

response = google_images_download.googleimagesdownload()   #class instantiation

arguments = {"keywords":"Polar bears,baloons,Beaches","limit":20,"print_urls":True}   #creating list of arguments
paths = response.download(arguments)   #passing the arguments to the function
print(paths)   #printing absolute paths of the downloaded images

or as a commandline tool, as follows:或作为命令行工具,如下所示:

$ googleimagesdownload --k "car" -sk 'red,blue,white' -l 10

You can install this with pip install google_images_download您可以使用pip install google_images_download

A simple solution to this problem is to install a python package called google_images_download这个问题的一个简单解决方案是安装一个名为google_images_download的 python 包

pip install google_images_download

use this python code使用这个python代码

from google_images_download import google_images_download  

response = google_images_download.googleimagesdownload()
keywords = "apple fruit"
arguments = {"keywords":keywords,"limit":20,"print_urls":True}
paths = response.download(arguments)
print(paths)

adjust the limit to control the no of images to download调整限制以控制要下载的图像数量

but some images won't open as they might be corrupt但有些图像无法打开,因为它们可能已损坏

change the keywords String to get the output you need更改keywords字符串以获得您需要的输出

You need to use the custom search API.您需要使用自定义搜索 API。 There is a handy explorer here.这里有一个方便的浏览器 I use urllib2.我使用 urllib2。 You also need to create an API key for your application from the developer console.您还需要从开发者控制台为您的应用程序创建一个 API 密钥。

To get the best out of googleimagedownload use pip3 install to obtain it and then use the following wrapper to turn it into an API. 要充分利用googleimagedownload,请使用pip3 install获取它,然后使用以下包装将其转换为API。 Basically you can see that I've said as part of the code to download 10 large images that are marked with labeled for reuse (misspelt by the original authors). 基本上你可以看到我作为代码的一部分说下载10个标记为重复使用的大图像(原作者拼错)。 If I don't pass an argument of say -k="yellow pepper" it will download 10 red pepper images by default. 如果我没有传递说法-k =“黄椒”的论据,它将默认下载10个红辣椒图像。 You can change the default arguments in the dictionary googleImageDownloader that I have provided to whatever you like as long as they conform to the google_images_download.py of the developer. 您可以将我提供的字典googleImageDownloader中的默认参数更改为您喜欢的任何内容,只要它们符合开发人员的google_images_download.py即可。

#!/usr/bin/env python3

import sys
import subprocess
import re

def main( arguments ):
  googleImageDownloader = {'s':'large', 'l':'10', 'r':'labled-for-reuse', 'k':'red pepper'}
  for argvitem in arguments[1:]:
    argumentName = re.sub( r'^-(.*)', r'\1', argvitem )
    argumentName = re.sub( r'^-(.*)', r'\1', argumentName )
    argumentName = re.sub( r'(.*)=(.*)', r'\1', argumentName )
    value        = re.sub( r'(.*)=(.*)', r'\2', argvitem )

    googleImageDownloader[argumentName] = value

  callingString = "googleimagesdownload"
  for key, value in googleImageDownloader.items():
    if " " in value:
      value = "\"" + value + "\""

    callingString+= " -" + key + " " + value

  print( callingString )
  statusAndOutputText = subprocess.getstatusoutput( callingString )
  print( statusAndOutputText[1] )

if "__main__" == __name__:
  main( sys.argv )

So I just run the above imagedownload.py passing any argument with -- or -: 所以我只是运行上面的imagedownload.py传递任何参数 - 或 - :

$ python ./imagedownload.py -k="yellow pepper"

to obtain the following result: 获得以下结果:

googleimagesdownload -s large -l 10 -k "yellow pepper" -r labeled-for-reuse

Item no.: 1 --> Item name = yellow pepper
Evaluating...
Starting Download...
Completed Image ====> 1. paprika-vegetables-yellow-red-53008.jpe
Completed Image ====> 2. plant-fruit-orange-food-pepper-produce-vegetable-yellow-peppers-bell-pepper-flowering-plant-yellow-pepper-land-plant-bell-peppers-and-chili-peppers-pimiento-habanero-chili-137913.jpg
Completed Image ====> 3. yellow-bell-pepper.jpg
Completed Image ====> 4. yellow_bell_pepper_group_store.jpg
Completed Image ====> 5. plant-fruit-food-produce-vegetable-yellow-peppers-bell-pepper-persimmon-diospyros-flowering-plant-sweet-pepper-yellow-pepper-land-plant-bell-peppers-and-chili-peppers-pimiento-habanero-chili-958689.jpg
Completed Image ====> 6. 2017-06-28-10-23-21.jpg
Completed Image ====> 7. yellow_bell_pepper_2017_a3.jpg
Completed Image ====> 8. 2017-06-26-12-06-35.jpg
Completed Image ====> 9. yellow-bell-pepper-1312593087h9f.jpg
Completed Image ====> 10. plant-fruit-food-pepper-produce-vegetable-macro-yellow-background-vegetables-peppers-bell-pepper-vitamins-flowering-plant-chili-pepper-annex-yellow-pepper-land-plant-bell-peppers-and-chili-peppers-pimiento-habanero-chili-1358020.jpg

Everything downloaded!
Total Errors: 0

I have tried many codes but none of them working for me.我尝试了很多代码,但没有一个对我有用。 I am posting my working code here.我在这里发布我的工作代码。 Hope it will help others.希望它能帮助别人。

I am using Python version 3.6 and used icrawler我使用的是 Python 3.6 版并使用了icrawler

First, you need to download icrawler in your system.首先,您需要在系统中下载icrawler

Then run below code.然后运行下面的代码。

from icrawler.examples import GoogleImageCrawler
google_crawler = GoogleImageCrawler()
google_crawler.crawl(keyword='krishna', max_num=100)

Replace keyword krishna with your desired text.用您想要的文本替换keyword krishna

Note :- Downloaded image needs path.注意:- 下载的图像需要路径。 Right now I used same directory where script placed.现在我使用了放置脚本的相同目录。 You can set custom directory via below code.您可以通过以下代码设置自定义目录。

google_crawler = GoogleImageCrawler('path_to_your_folder')

If someone still needs this, I just posted a simple python project to download all Google images from a given query with no limit如果有人仍然需要这个,我只是发布了一个简单的 python 项目,可以从给定的查询中无限制地下载所有谷歌图像

It's available at https://github.com/misterymachine/google-image-downloader它可以在https://github.com/miserymachine/google-image-downloader 上找到

Have a great day, hope you'll enjoy this project !祝你有美好的一天,希望你会喜欢这个项目!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM