[英]Scraping all the links from a website using scrapy not working
我試圖刪除所有在網站上也分頁的鏈接。 下面給出的是我的草率代碼,但是該代碼無法正常工作。 它只是從第一頁抓取網址鏈接。 如何取消所有鏈接? 謝謝
# -*- coding: utf-8 -*-
import scrapy
class DummySpider(scrapy.Spider):
name = 'dummyspider'
allowed_domains = ['alibaba.com']
start_urls = ['https://www.alibaba.com/countrysearch/CN/China/products/A.html'
]
def parse(self, response):
link = response.xpath('//*[@class="column one3"]/a/@href').extract()
for item in zip(link):
scraped_info = {
'link':item[0],
}
yield scraped_info
next_page_url = response.xpath('//*[@class="page_btn"]/@href').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url = next_page_url, callback = self.parse)
起始網址是https://www.alibaba.com/countrysearch/CN/China/products/A.html
您可以通過正確設置起始網址來解決此問題。
string
模塊具有字母常量:
$ import string
$ string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
您可以用來以編程方式創建網址:
import string
from scrapy import Spider
class MySpider(Spider):
name = 'alibaba'
start_urls = [
f'http://foo.com?letter={char}'
for char in string.ascii_uppercase
]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.