I tried to scrap all the links which are also paginated in the website. Given below is my scrapy code, but the code is not working. It is only scrapping the urls links from the first page. How do i scrap all the links? Thanks
# -*- coding: utf-8 -*-
import scrapy
class DummySpider(scrapy.Spider):
name = 'dummyspider'
allowed_domains = ['alibaba.com']
start_urls = ['https://www.alibaba.com/countrysearch/CN/China/products/A.html'
]
def parse(self, response):
link = response.xpath('//*[@class="column one3"]/a/@href').extract()
for item in zip(link):
scraped_info = {
'link':item[0],
}
yield scraped_info
next_page_url = response.xpath('//*[@class="page_btn"]/@href').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url = next_page_url, callback = self.parse)
The starting url is https://www.alibaba.com/countrysearch/CN/China/products/A.html
You can solve this by setting up your start urls properly.
string
module has alphabet constants:
$ import string
$ string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
You can use to create your urls programmatically:
import string
from scrapy import Spider
class MySpider(Spider):
name = 'alibaba'
start_urls = [
f'http://foo.com?letter={char}'
for char in string.ascii_uppercase
]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.