简体   繁体   中英

How to schedule spider to run every 5 minutes?

I been trying to figure out how to schedule my scrapy spider for days now without any luck. ( I tried everything from Windows Task Scheduler to scrapy-do lib but nothing have worked on my MAIN.PY )

( my main goal is to schedule my spider to gather data from my spider NewsSpider to mySQL news_db database every 5 minutes )

Please look at my script as its a bit modified and change it if needed. I really want this to work.

MAIN.PY

from scrapy import cmdline
cmdline.execute("scrapy crawl news".split())

NEWS_SPIDER.PY

import scrapy
from ..items import WebspiderItem


class NewsSpider(scrapy.Spider):
    name = 'news'
    start_urls = [
        'https://www.coindesk.com/feed'
    ]

    def parse(self, response):
        pub_date = response.xpath('//pubDate/text()').extract()[0]
        page_title = response.xpath('//title/text()').extract()[2]
        page_summary = response.xpath('//description/text()').extract()[1]
        text_link = response.xpath('//link/text()').extract()[2]

        item = WebspiderItem()
        item['date'] = pub_date
        item['title'] = page_title
        item['summary'] = page_summary
        item['link'] = text_link

        yield item

ITEMS.PY

import scrapy


class WebspiderItem(scrapy.Item):
    # define the fields for your item here like:
    date = scrapy.Field()
    title = scrapy.Field()
    summary = scrapy.Field()
    link = scrapy.Field()

PIPELINES.PY

import mysql.connector


class WebspiderPipeline(object):

    def __init__(self):
        self.create_connection()

    def create_connection(self):
        self.conn = mysql.connector.connect(
            host='localhost',
            user='root',
            passwd='passordpassord',
            database='news_db'
        )
        self.curr = self.conn.cursor()

    def process_item(self, item, spider):
        self.store_db(item)
        return item

    def store_db(self, item):
        self.curr.execute("""insert into news_tb values (%s, %s, %s, %s)""", (
            item['date'],
            item['title'],
            item['summary'],
            item['link']

        ))
        self.conn.commit()

Using the schedule package worked for me on both Windows locally and on my Linux server. Simply install it with pip install schedule . Then setup a new job by pasting the following into the main.py file:

import schedule
import time
import os

print('Scheduler initialised')
schedule.every(5).minutes.do(lambda: os.system('scrapy crawl news'))
print('Next job is set to run at: ' + str(schedule.next_run()))

while True:
    schedule.run_pending()
    time.sleep(1)

Then run python main.py in the terminal. The script will run the scrapy crawl news command every 5 minutes as long as you don't close the terminal.

Please note, that it's quite important to use os.system() rather than cmdline.execute() since, as far as I recall, cmdline.execute() exits the scheduler's infinite while loop when the job is completed. os.system() doesn't do this and will therefore wait for another job to run after yet another 5 minutes has passed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM