简体   繁体   中英

What's the easiest way to quickly check a Scrapy behaviour / bug?

I sometimes try to solve Scrapy problems on stackoverflow , but usually do not test my ideas, as I do not know how to quickly do this, without setting up a whole Scrapy project and parsing a real web page.

What's the quickest way to check problems / solutions with an offline example file and without having to create a whole new scrapy project?

For running a spider from single-file

If your spider doesn't depend on pipelines or any of the regular stuff used on Scrapy projects, one idea is to create a self-contained file and run the spider using the command:

scrapy runspider file_with_my_spider.py

Scrapy will look for the first spider in the file (a class extending scrapy.Spider or its derivative scrapy.CrawlSpider ) and run it.

If you are trying to isolate the code of a spider code that is originally inside a Scrapy project, you will also have to copy the item classes and any other dependencies to this single file.

For running a spider for a test site

For offline testing, you can replicate the site structure putting the HTML pages in a directory, then run python -m SimpleHTTPServer on it: this will start up a local server on http://localhost:8000/ that you can run the spider against it.

To make it easy to decide when you want to run against the local server and the real site, you can make your spider look like this:

import scrapy

class MySpider(scrapy.Spider):
    name = 'my-spider'
    start_urls = ['http://www.some-real-site-url.com']

    def __init__(self, start_url=None, *args, **kwargs):
        if start_url:
            self.start_urls = [start_url]

    ...

Having this in your spider, you will be able to do:

scrapy runspider file_with_my_spider.py -a start_url=http://localhost:8000/

for running the spider against the site showing up in the local server.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM