如何从Scrapy起始网址中删除多余的字符或符号？

Question

I got a Scrapy spider and when I run the code I am getting this error 我有一个Scrapy蜘蛛，当我运行代码时出现此错误

Ignoring response <302 https://www.macys.com/ >: HTTP status code is not handled or not allowed 忽略响应<302 https://www.macys.com/ >：未处理或不允许HTTP状态代码

Here is my Spider 这是我的蜘蛛

import scrapy
import urllib.parse
import random

class MacysspiderSpider(scrapy.Spider):
    name = 'macysSpider'
    allowed_domains = ['macys.com']
    start_urls = ['https://macys.com']

    def parse(self, response):
        pass

I inspected the URL, and when I run the code it is including ">" at the end of the URL 我检查了URL，然后在运行代码时在URL末尾包含“>”

https://www.macys.com/ > https://www.macys.com/ >

how can I remove this UTF-8 from the start URL? 如何从起始网址中删除此UTF-8？

Answer 1

Not sure where you found the '>' as part of the url, but I don't think it has anything to do with the problem. 不知道您在URL的哪儿找到了“>”，但是我认为这与问题无关。 You need to set some headers to scrape this website: 您需要设置一些标题才能抓取该网站：

headers = {
    'authority': 'www.macys.com',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-GB,en;q=0.9,nl-BE;q=0.8,nl;q=0.7,ro-RO;q=0.6,ro;q=0.5,en-US;q=0.4',
}

To apply these changes into your first request you can overwrite the start_requests method as follows: 要将这些更改应用到您的第一个请求中，可以按如下所示覆盖start_requests方法：

def start_requests(self):
    for url in self.start_urls:
        yield Request(url, headers=self.headers)

如何从Scrapy起始网址中删除多余的字符或符号？

问题描述

1 个解决方案

解决方案1
3 已采纳 2019-07-18 07:42:41

如何从Scrapy起始网址中删除多余的字符或符号？

问题描述

1 个解决方案

解决方案1 3 已采纳 2019-07-18 07:42:41

解决方案1
3 已采纳 2019-07-18 07:42:41