简体   繁体   中英

Crawl images from google search with python

I am trying to write a script in python in order to crawl images from google search. I want to track the urls of images and after that store those images to my computer. I found a code to do so. However it only track 60 urls. Afterthat a timeout message appears. Is it possible to track more than 60 images? My code:

def crawl_images(query, path):

    BASE_URL = 'https://ajax.googleapis.com/ajax/services/search/images?'\
         'v=1.0&q=' + query + '&start=%d'

    BASE_PATH = os.path.join(path, query)

    if not os.path.exists(BASE_PATH):
        os.makedirs(BASE_PATH)

    counter = 1
    urls = []
    start = 0 # Google's start query string parameter for pagination.
    while start < 60: # Google will only return a max of 56 results.
        r = requests.get(BASE_URL % start)
        for image_info in json.loads(r.text)['responseData']['results']:
            url = image_info['unescapedUrl']
            print url
            urls.append(url)
            image = urllib.URLopener()

            try:
                image.retrieve(url,"model runway/image_"+str(counter)+".jpg")   
                counter +=1
            except IOError, e:
                # Throw away some gifs...blegh.
                print 'could not save %s' % url
                continue

        print start
        start += 4 # 4 images per page.
        time.sleep(1.5)

crawl_images('model runway', '')

Have a look at the Documentation: https://developers.google.com/image-search/v1/jsondevguide

You should get up to 64 results:

Note: The Image Searcher supports a maximum of 8 result pages. When combined with subsequent requests, a maximum total of 64 results are available. It is not possible to request more than 64 results.

Another note: You can restrict the file type, this way you dont need to ignore gifs etc.


And as an additional Note, please keep in mind that this API should only be used for user operations and not for automated searches!

Note: The Google Image Search API must be used for user-generated searches. Automated or batched queries of any kind are strictly prohibited.

You can try the icrawler package. Extremely easy to use. I've never had problems with the number of images to be downloaded.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM