简体   繁体   中英

Scrapy not scraping all pages

I'm new to scrapy and have been trying to develop a spider that scrapes Tripadvisor's things to do page. Trip advisor paginates results with offset so I made it find the last page num, multiply the number of results per page, and loop over a range with a step of 30. However it returns only a fraction of the results its supposed to, and get_details prints out 7 of the 28 pages scraped. I believe what is happening is url redirection on random pages.

Scrapy logs this 301 redirection on the other pages, and it appears to be redirecting to the first page. I tried disabling redirection but that did not work.

2021-03-28 18:46:38 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.tripadvisor.com/Attractions-g55229-Activities-a_allAttractions.true-Nashville_Davidson_County_Tennessee.html> from <GET https://www.tripadvisor.com/Attractions-g55229-Activities-a_allAttractions.true-oa90-Nashville_Davidson_County_Tennessee.html>

Here's my code for the spider:

import scrapy
import re


class TripadvisorSpider(scrapy.Spider):
    name = "tripadvisor"

    start_urls = [
        'https://www.tripadvisor.com/Attractions-g55229-Activities-a_allAttractions.true-oa{}-Nashville_Davidson_County_Tennessee.html'
    ]

    def parse(self, response):

        num_pages = int(response.css(
            '._37Nr884k .DrjyGw-P.IT-ONkaj::text')[-1].get())

        for offset in range(0, num_pages * 30, 30):
            formatted_url = self.start_urls[0].format(offset)
            yield scrapy.Request(formatted_url, callback=self.get_details)

    def get_details(self, response):
        print('url is ' + response.url)
        for listing in response.css('div._19L437XW._1qhi5DVB.CO7bjfl5'):
            yield {
                'title': listing.css('._392swiRT ._1gpq3zsA._1zP41Z7X::text')[1].get(),
                'category': listing.css('._392swiRT ._1fV2VpKV .DrjyGw-P._26S7gyB4._3SccQt-T::text').get(),
                'rating':  float(re.findall(r"[-+]?\d*\.\d+|\d+", listing.css('svg.zWXXYhVR::attr(title)').get())[0]),
                'rating_count': float(listing.css('._392swiRT .DrjyGw-P._26S7gyB4._14_buatE._1dimhEoy::text').get().replace(',', '')),
                'url': listing.css('._3W_31Rvp._1nUIPWja._17LAEUXp._2b3s5IMB a::attr(href)').get(),
                'main_image': listing.css('._1BR0J4XD').attrib['src']
            }

Is there a way to get scrapy working for each page? What is causing this problem exactly?

Found a solution. Discovered I needed to handle the redirection manually and disable Scrapy's default middleware.

Here is the custom middleware I added to middlewares.py

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.selector import Selector
from scrapy.utils.response import get_meta_refresh

class CustomRetryMiddleware(RetryMiddleware):

    def process_response(self, request, response, spider):
        url = response.url
        if response.status in [301, 307]:
            reason = 'redirect %d' % response.status
            return self._retry(request, reason, spider) or response
        interval, redirect_url = get_meta_refresh(response)
        # handle meta redirect
        if redirect_url:
            reason = 'meta'
            return self._retry(request, reason, spider) or response
        hxs = Selector(response)
        # test for captcha page
        captcha = hxs.xpath(
            ".//input[contains(@id, 'captchacharacters')]").extract()
        if captcha:
            reason = 'capcha'
            return self._retry(request, reason, spider) or response
        return response

It is an updated version from this question's top answer. Scrapy retry or redirect middleware

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM