Scraping all the links from a website using scrapy not working

Question

I tried to scrap all the links which are also paginated in the website. Given below is my scrapy code, but the code is not working. It is only scrapping the urls links from the first page. How do i scrap all the links? Thanks

# -*- coding: utf-8 -*-
import scrapy


class DummySpider(scrapy.Spider):
    name = 'dummyspider'
    allowed_domains = ['alibaba.com']
    start_urls = ['https://www.alibaba.com/countrysearch/CN/China/products/A.html'
                ]

    def parse(self, response):
        link = response.xpath('//*[@class="column one3"]/a/@href').extract()

        for item in zip(link):
            scraped_info = {
                'link':item[0],

            }
            yield scraped_info
        next_page_url = response.xpath('//*[@class="page_btn"]/@href').extract_first()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(url = next_page_url, callback = self.parse)

The starting url is https://www.alibaba.com/countrysearch/CN/China/products/A.html

Answer 1

You can solve this by setting up your start urls properly.

string module has alphabet constants:

$ import string
$ string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

You can use to create your urls programmatically:

import string
from scrapy import Spider  

class MySpider(Spider):
    name = 'alibaba'
    start_urls = [
        f'http://foo.com?letter={char}' 
        for char in string.ascii_uppercase
    ]

Scraping all the links from a website using scrapy not working

Question

1 answers

solution1
2 ACCPTED 2018-08-19 14:10:28

Scraping all the links from a website using scrapy not working

Question

1 answers

solution1 2 ACCPTED 2018-08-19 14:10:28

solution1
2 ACCPTED 2018-08-19 14:10:28