使用scrapy無法抓取網站上的所有鏈接

Question

我試圖刪除所有在網站上也分頁的鏈接。 下面給出的是我的草率代碼，但是該代碼無法正常工作。 它只是從第一頁抓取網址鏈接。 如何取消所有鏈接？ 謝謝

# -*- coding: utf-8 -*-
import scrapy


class DummySpider(scrapy.Spider):
    name = 'dummyspider'
    allowed_domains = ['alibaba.com']
    start_urls = ['https://www.alibaba.com/countrysearch/CN/China/products/A.html'
                ]

    def parse(self, response):
        link = response.xpath('//*[@class="column one3"]/a/@href').extract()

        for item in zip(link):
            scraped_info = {
                'link':item[0],

            }
            yield scraped_info
        next_page_url = response.xpath('//*[@class="page_btn"]/@href').extract_first()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(url = next_page_url, callback = self.parse)

起始網址是https://www.alibaba.com/countrysearch/CN/China/products/A.html

Answer 1

您可以通過正確設置起始網址來解決此問題。

string模塊具有字母常量：

$ import string
$ string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

您可以用來以編程方式創建網址：

import string
from scrapy import Spider  

class MySpider(Spider):
    name = 'alibaba'
    start_urls = [
        f'http://foo.com?letter={char}' 
        for char in string.ascii_uppercase
    ]

使用scrapy無法抓取網站上的所有鏈接

問題描述

1 個解決方案

解決方案1
2 已采納 2018-08-19 14:10:28

使用scrapy無法抓取網站上的所有鏈接

問題描述

1 個解決方案

解決方案1 2 已采納 2018-08-19 14:10:28

解決方案1
2 已采納 2018-08-19 14:10:28