简体   繁体   中英

Why is Scrapy skipping over links?

I'm a newbie to Python and I'm trying out Scrapy against Amazon. I'm trying to get the item names and counts from

https://www.amazon.ca/gp/search/other/ref=sr_in_a_C?rh=i%3Akitchen%2Cn%3A2206275011%2Cn%3A%212206276011%2Cn%3A2224068011%2Cn%3A6647367011%2Cn%3A6647368011&page=2&bbn=6647368011&pickerToList=lbr_brands_browse-bin&indexField=a&ie=UTF8&qid=1515439041

Here's my Python code:

import scrapy

class ToScrapeCSSSpider(scrapy.Spider):
    name = "toscrapeamazon-css"
    start_urls = [
        "https://www.amazon.ca/gp/search/other/ref=sr_in_a_-2?rh=i%3Akitchen%2Cn%3A2206275011%2Cn%3A%212206276011%2Cn%3A2224068011%2Cn%3A6647367011%2Cn%3A6647368011&page=2&bbn=6647368011&pickerToList=lbr_brands_browse-bin&indexField=a&ie=UTF8&qid=1515436664",
    ]

    def parse(self, response):
        for item in response.css("span.a-list-item"):
            yield {
                "item_name": item.css("span.refinementLink::text").extract_first(),
                "item_cnt": item.css("span.narrowValue::text").extract_first()
            }

        next_page_url = response.css("span.pagnLink > a::attr(href)").extract_first()
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

I'm able to get most of the data I want, but I'm not getting anything for alphabets D, E, I, J. Any idea what I'm doing wrong?

I tried your code; the fact that it ran in a few seconds and finished with this log message:

Filtered duplicate request: <GET https://www.amazon.ca/gp/search/other?ie=UTF8&page=2&pickerToList=lbr_brands_browse-bin&rh=n%3A6647368011>

led me to look at the letter links. Turns out you're not getting what you think you are. Look closely at the URLs for the letter links at the top: they're all the same. They each point to the "top brands" page, which is what you're actually scraping. It just happens that there are no "top brands" that begin with D, E, I, or J (or Q, Y, or Z). There must be a javascript listener on the letter links that intercepts the click and redirects you to the letter-specific URL, which looks like this:

https://www.amazon.ca/gp/search/other/ref=sr_in_e_A?rh=i%3Akitchen%2Cn%3A6647368011&page=2&pickerToList=lbr_brands_browse-bin&indexField=e&ie=UTF8&qid=1516249484

because no such links exist in the HTML. If you want to scrape those, you're going to have to generate them yourself and feed them to scrappy. Fortunately it's pretty easy - you just need to replace the e in indexField=e with each other letter.

Handle your errbacks like all the response codes corresponds to the [404, 403, 302, 503, 502, 400, 407] and make another request as follows.

     if response.status in [404,403, 302, 503, 502, 400, 407]:
        yield Request(url=response.request.url, callback=self.parse,dont_filter=True)

make sure if you're using concurrent requests you have enough proxies

.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM