简体   繁体   English

Python Scrapy 301重定向

[英]Python Scrapy 301 redirects

I have a little problem in printing the redirected urls (new URLs after 301 redirection) when scraping a given website. 在抓取给定网站时,我在打印重定向的网址(301重定向后的新网址)时遇到了一些问题。 My idea is to only print them and not scrape them. 我的想法是只打印它们而不是刮掉它们。 My current piece of code is: 我目前的一段代码是:

import scrapy
import os
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'rust'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        #if response.status == 301:
        print response.url

However, this does not print the redirected urls. 但是,这不会打印重定向的URL。 Any help will be appreciated. 任何帮助将不胜感激。

Thank you. 谢谢。

To parse any responses that are not 200 you'd need to do one of these things: 要解析任何不是200的响应,您需要执行以下操作之一:

Project-wide 项目范围

You can set setting HTTPERROR_ALLOWED_CODES = [301,302,...] in settings.py file. 您可以在settings.py文件中设置HTTPERROR_ALLOWED_CODES = [301,302,...] settings.py Or if you want to enable it for all codes you can set HTTPERROR_ALLOW_ALL = True instead. 或者,如果要为所有代码启用它,则可以设置HTTPERROR_ALLOW_ALL = True

Spider-wide 蜘蛛宽

Add handle_httpstatus_list parameter to your spider. handle_httpstatus_list参数添加到您的蜘蛛。 In your case something like: 在你的情况下像:

class MySpider(scrapy.Spider):
    handle_httpstatus_list = [301]
    # or 
    handle_httpstatus_all = True

Request-wide 请求范围

You can set these meta keys in your requests handle_httpstatus_list = [301, 302,...] or handle_httpstatus_all = True for all: 您可以在请求中设置这些metahandle_httpstatus_list = [301, 302,...]handle_httpstatus_all = True

scrapy.request('http://url.com', meta={'handle_httpstatus_list': [301]})

To learn more see HttpErrorMiddleware 要了解更多信息,请参阅HttpErrorMiddleware

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM