简体   繁体   中英

Using scrapy to download google images from multiple urls

I am trying to download images from multiple urls from a search in google images.

However, i only want 15 images from each url.

class imageSpider(BaseSpider):
    name = "image"
    start_urls = [
        'https://google.com/search?q=simpsons&tbm=isch'
        'https://google.com/search?q=futurama&tbm=isch'
        ]


def parse(self,response):
    hxs = HtmlXPathSelector(response)
    items = []
    images = hxs.select("//div[@id='ires']//div//a[@href]")
    count = 0
    for image in images:
        count += 1
        item = ImageItem()
        image_url = image.select(".//img[@src]")[0].extract()
        import urlparse
        image_absolute_url = urlparse.urljoin(response.url, image_url.strip())
        index = image_absolute_url.index("src")
        changedUrl = image_absolute_url[index+5:len(image_absolute_url)-2]
        item['image_urls'] = [changedUrl]
        index1 = site['url'].index("search?q=")
        index2 = site['url'].index("&tbm=isch")
        imageName = site['url'][index1+9:index2]
        download(changedUrl,imageName + str(count)+".png")
        items.append(item)
        if count == 15:
            break
    return items

The download function downloads the images (i have code for that. that's not the problem).

The problem is that when i break, it stops at the first url and never continues on to the next url. How could i make it download 15 images for the first url and then 15 images for the 2nd url. I am using break because there are about 1000 images in every google images page and i don't want that many.

The problem is not about break statement. you have missed a comma in start_urls .

it should be like this:

start_urls = [
    'http://google.com/search?q=simpsons&tbm=isch',
    'http://google.com/search?q=futurama&tbm=isch'
]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM