简体   繁体   中英

Scraping all the links from a website using scrapy not working

I tried to scrap all the links which are also paginated in the website. Given below is my scrapy code, but the code is not working. It is only scrapping the urls links from the first page. How do i scrap all the links? Thanks

# -*- coding: utf-8 -*-
import scrapy


class DummySpider(scrapy.Spider):
    name = 'dummyspider'
    allowed_domains = ['alibaba.com']
    start_urls = ['https://www.alibaba.com/countrysearch/CN/China/products/A.html'
                ]

    def parse(self, response):
        link = response.xpath('//*[@class="column one3"]/a/@href').extract()

        for item in zip(link):
            scraped_info = {
                'link':item[0],

            }
            yield scraped_info
        next_page_url = response.xpath('//*[@class="page_btn"]/@href').extract_first()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(url = next_page_url, callback = self.parse)

The starting url is https://www.alibaba.com/countrysearch/CN/China/products/A.html

You can solve this by setting up your start urls properly.

string module has alphabet constants:

$ import string
$ string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

You can use to create your urls programmatically:

import string
from scrapy import Spider  

class MySpider(Spider):
    name = 'alibaba'
    start_urls = [
        f'http://foo.com?letter={char}' 
        for char in string.ascii_uppercase
    ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM