Using scrapy to download google images from multiple urls

Question

I am trying to download images from multiple urls from a search in google images.

However, i only want 15 images from each url.

class imageSpider(BaseSpider):
    name = "image"
    start_urls = [
        'https://google.com/search?q=simpsons&tbm=isch'
        'https://google.com/search?q=futurama&tbm=isch'
        ]


def parse(self,response):
    hxs = HtmlXPathSelector(response)
    items = []
    images = hxs.select("//div[@id='ires']//div//a[@href]")
    count = 0
    for image in images:
        count += 1
        item = ImageItem()
        image_url = image.select(".//img[@src]")[0].extract()
        import urlparse
        image_absolute_url = urlparse.urljoin(response.url, image_url.strip())
        index = image_absolute_url.index("src")
        changedUrl = image_absolute_url[index+5:len(image_absolute_url)-2]
        item['image_urls'] = [changedUrl]
        index1 = site['url'].index("search?q=")
        index2 = site['url'].index("&tbm=isch")
        imageName = site['url'][index1+9:index2]
        download(changedUrl,imageName + str(count)+".png")
        items.append(item)
        if count == 15:
            break
    return items

The download function downloads the images (i have code for that. that's not the problem).

The problem is that when i break, it stops at the first url and never continues on to the next url. How could i make it download 15 images for the first url and then 15 images for the 2nd url. I am using break because there are about 1000 images in every google images page and i don't want that many.

Answer 1

The problem is not about break statement. you have missed a comma in start_urls .

it should be like this:

start_urls = [
    'http://google.com/search?q=simpsons&tbm=isch',
    'http://google.com/search?q=futurama&tbm=isch'
]

Using scrapy to download google images from multiple urls

Question

1 answers

solution1
1 ACCPTED 2012-06-22 00:59:23

Using scrapy to download google images from multiple urls

Question

1 answers

solution1 1 ACCPTED 2012-06-22 00:59:23

solution1
1 ACCPTED 2012-06-22 00:59:23