简体   繁体   中英

scrapy: preventing crawlspider from crawling linked in/facebook websites

is there anyway that I can control my crawlspider so that it doesn't crawl outside of the original domain that I specified in start_urls list? I tried what is below but it wouldn't work for me :( :

import os
from scrapy.selector import Selector
from scrapy.contrib.exporter import CsvItemExporter
from scrapy.item import Item, Field
from scrapy.settings import Settings
from scrapy.settings import default_settings 
from selenium import webdriver
from urlparse import urlparse
import csv    
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import log

default_settings.DEPTH_LIMIT = 3
DOWNLOADER_MIDDLEWARES = {
                'grimes2.middlewares.CustomDownloaderMiddleware': 543,
                'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None
                     }

can someone help me? thank you .

An optional list of strings containing domains that this spider is allowed to crawl. 可选字符串列表,其中包含允许该蜘蛛爬网的域。 Requests for URLs not belonging to the domain names specified in this list won't be followed if OffsiteMiddleware is enabled.

see how it's been used in scrapy tutorial

from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM