I been trying to figure out how to schedule my scrapy spider for days now without any luck. ( I tried everything from Windows Task Scheduler to scrapy-do lib but nothing have worked on my MAIN.PY )
( my main goal is to schedule my spider to gather data from my spider NewsSpider to mySQL news_db database every 5 minutes )
Please look at my script as its a bit modified and change it if needed. I really want this to work.
MAIN.PY
from scrapy import cmdline
cmdline.execute("scrapy crawl news".split())
NEWS_SPIDER.PY
import scrapy
from ..items import WebspiderItem
class NewsSpider(scrapy.Spider):
name = 'news'
start_urls = [
'https://www.coindesk.com/feed'
]
def parse(self, response):
pub_date = response.xpath('//pubDate/text()').extract()[0]
page_title = response.xpath('//title/text()').extract()[2]
page_summary = response.xpath('//description/text()').extract()[1]
text_link = response.xpath('//link/text()').extract()[2]
item = WebspiderItem()
item['date'] = pub_date
item['title'] = page_title
item['summary'] = page_summary
item['link'] = text_link
yield item
ITEMS.PY
import scrapy
class WebspiderItem(scrapy.Item):
# define the fields for your item here like:
date = scrapy.Field()
title = scrapy.Field()
summary = scrapy.Field()
link = scrapy.Field()
PIPELINES.PY
import mysql.connector
class WebspiderPipeline(object):
def __init__(self):
self.create_connection()
def create_connection(self):
self.conn = mysql.connector.connect(
host='localhost',
user='root',
passwd='passordpassord',
database='news_db'
)
self.curr = self.conn.cursor()
def process_item(self, item, spider):
self.store_db(item)
return item
def store_db(self, item):
self.curr.execute("""insert into news_tb values (%s, %s, %s, %s)""", (
item['date'],
item['title'],
item['summary'],
item['link']
))
self.conn.commit()
Using the schedule package worked for me on both Windows locally and on my Linux server. Simply install it with pip install schedule
. Then setup a new job by pasting the following into the main.py
file:
import schedule
import time
import os
print('Scheduler initialised')
schedule.every(5).minutes.do(lambda: os.system('scrapy crawl news'))
print('Next job is set to run at: ' + str(schedule.next_run()))
while True:
schedule.run_pending()
time.sleep(1)
Then run python main.py
in the terminal. The script will run the scrapy crawl news
command every 5 minutes as long as you don't close the terminal.
Please note, that it's quite important to use os.system()
rather than cmdline.execute()
since, as far as I recall, cmdline.execute()
exits the scheduler's infinite while loop when the job is completed. os.system()
doesn't do this and will therefore wait for another job to run after yet another 5 minutes has passed.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.