简体   繁体   中英

Scrapy: Get specific part of a URL before redirection

Here's the code I'll be working with (I'm using scrapy)

def start_requests(self):
        start_urls = ['https://www.lowes.com/search?searchTerm=8654RM-42']

This is where I'm storing all my URLS


Here is how I'm trying to only print everything after the '='

            productSKU = response.url.split("=")[-1]
            item["productSKU"] = productSKU

Here is the output:

{'productPrice': '1,449.95',
 'productSKU': 'https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Ducted-Red-Matte-Wall-Mounted-Range-Hood-Common-42-Inch-Actual-42-in/1001440644'}

So now here's the problem:

The URLs I'm inputting will eventually be populated with

https://www.lowes.com/search?searchTerm = {something}

and that's why I would like to use {something} to ensure I'll have every item that I attempted to scrape on the CSV (for sorting and matching purposes).

The URL I'm using redirects to me this URL:

(Input) https://www.lowes.com/search?searchTerm=8654RM-42

->

(Redirect) https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Ducted-Red-Matte-Wall-Mounted-Range-Hood-Common-42-Inch-Actual-42-in/1001440644

And so, my output for productSKU is the entire redirect URL instead of just whatever is after the '=' sign. The output I would like would be 8654RM-42.

And here is my whole program

# -*- coding: utf-8 -*-
import scrapy
from ..items import LowesspiderItem
from scrapy.http import Request

class LowesSpider(scrapy.Spider):
name = 'lowes'

def start_requests(self):
    start_urls = ['https://www.lowes.com/search?searchTerm=8654RM-42']

    for url in start_urls:
        yield Request(url, cookies={'sn':'2333'}) #Added cookie to bypass location req 

def parse(self, response):
    items = response.css('.grid-container')
    for product in items:
        item = LowesspiderItem()

    #get product price
        productPrice = product.css('.art-pd-price::text').get()
        productSKU = response.url.split("=")[-1]


        item["productSKU"] = productSKU
        item["productPrice"] = productPrice


        yield item

you need to use meta to pass in the input url like this

def start_requests(self):
    start_urls = ['https://www.lowes.com/search?searchTerm=8654RM-42']

    for url in start_urls:
        yield Request(url, cookies={'sn':'2333'},meta={'url':url)
def parse(self,response):
    url = response.meta['url'] #your input url

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM