简体   繁体   中英

Scrapy: Issues with scraping multiple pages

I'm trying to build a spider using Scrapy, that returns the data of multiple pages. So far, I'm good with scraping data from the first page, but I'm having trouble to go further. This is my code so far:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class AutoscoutSpider(scrapy.Spider):
    name = 'autoscout'
    allowed_domains = ['www.autoscout24.de']
    start_urls = ['https://www.autoscout24.de/ergebnisse?mmvmk0=29&mmvco=1&cy=D&powertype=kw&atype=C&ustate=N%2CU&sort=standard&desc=0']


    def parse(self, response):
        car_name = response.css(".cldt-summary-makemodel::text").extract()
        car_functions = response.css(".cldt-summary-subheadline.sc-font-m.sc-ellipsis::text").extract()
        car_price = response.css(".cldt-price.sc-font-xl.sc-font-bold::text").extract()
        filtered_car_price = filter(lambda x: x not in '\n\n€,-\n', car_price)

        for item in zip(car_name,filtered_car_price,car_functions):
            zipped_info = {
                            'name' : item[0],
                            'price' : item[1],
                            'description' : item[2],
                                             }

            yield zipped_info

I tried using a LinkExtractor to grab the url's of the following pages:

rules = (Rule(LinkExtractor(allow=(), restrict_css=('.next-page',)),
         callback="parse_item", follow=True))

Therefore, I made sure to adjust the parse function to parse_item in order to prevent overwritting the base function of scrapy. I think I'm missing something in the restrict_css argument but I'm not sure what it is.

Looking at the page source, you can see that the navigation links aren't defined in the html, instead there is a template, which is later populated by javascript:

<div class="cl-pagination">
<ul class="sc-pagination" data-previous-text="Zurück" data-next-text="Weiter" data-page-size="20" data-current-page="1" data-total-items="86141" data-page-template="/ergebnisse?powertype=kw&amp;pricetype=public&amp;cy=D&amp;mmvmk0=29&amp;mmvco=1&amp;zipr=1000&amp;sort=standard&amp;ustate=N&amp;ustate=U&amp;atype=C&amp;page={page}&amp;size={size}"></ul>
</div>

From some simple testing I did, it seems that just adding a page parameter is enough to get to a different page in the listing.
However, it seems that both size and page are limited to 20, so you'll be limited to 400 results from a single search.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM