简体   繁体   中英

Does APScheduler Need a Function to Run?

I'm new to coding and this my first project. So far I've pieced together what I have through Googling, Tutorials and Stack.

I'm trying to add data from a pandas df of scraped RSS feeds to a remote sql database, then host the script on heroku or AWS and have the script running every hour.

Someone on here recommend that I use APScheduler as in this post .

I'm struggling though as there aren't any 'dummies' tutorials around APScheduler. This is what I've created so far.

I guess my question is does my script need to be in a function for APScheduler to trigger it or can it work another way.

from apscheduler.schedulers.blocking import BlockingScheduler

sched = BlockingScheduler()

@sched.scheduled_job('interval', minutes=1)

sched.configure()
sched.start()

import pandas as pd
from pandas.io import sql
import feedparser
import time


rawrss = ['http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
          'https://www.yahoo.com/news/rss/',
          'http://www.huffingtonpost.co.uk/feeds/index.xml',
          'http://feeds.feedburner.com/TechCrunch/',
          'https://www.uktech.news/feed'
         ]

time = time.strftime('%a %H:%M:%S')
summary = 'text'

posts = []
for url in rawrss:
    feed = feedparser.parse(url)
    for post in feed.entries:
        posts.append((time, post.title, post.link, summary))

df = pd.DataFrame(posts, columns=['article_time','article_title','article_url', 'article_summary']) # pass data to init
df.set_index(['article_time'], inplace=True)

import pymysql
from sqlalchemy import create_engine

engine = create_engine('mysql+pymysql://<username>:<host>:3306/<database_name>?charset=utf8', encoding = 'utf-8')
engine.execute("INSERT INTO rsstracker VALUES('%s', '%s', '%s','%s')" % (time, post.title, post.link, summary))

df.to_sql(con=engine, name='rsstracker', if_exists='append') #, flavor='mysql'

Yes. What you want to be executed must be a function (or another callable, like a method). The decorator syntax ( @sched.… ) needs a function definition ( def … ) to which the decorator is applied. The code in your example doesn't compile.

Then it's a blocking scheduler, meaning if you call sched.start() this method doesn't return (unless you stop the scheduler in some scheduled code) and nothing after the call is executed.

Imports should go to the top, then it's easier to see what the module depends on. And don' import things you don't actually use.

I'm not sure why you import and use pandas for data that doesn't really need DataFrame objects. Also SQLAlchemy without actually using anything this library offers and formatting values as strings into an SQL query which is dangerous !

Just using SQLAlchemy for the database it may look like this:

#!/usr/bin/env python
# coding: utf-8
from __future__ import absolute_import, division, print_function
from time import strftime

import feedparser
from apscheduler.schedulers.blocking import BlockingScheduler
from sqlalchemy import create_engine, MetaData

sched = BlockingScheduler()


RSS_URLS = [
    'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
    'https://www.yahoo.com/news/rss/',
    'http://www.huffingtonpost.co.uk/feeds/index.xml',
    'http://feeds.feedburner.com/TechCrunch/',
    'https://www.uktech.news/feed',
]


@sched.scheduled_job('interval', minutes=1)
def process_feeds():
    time = strftime('%a %H:%M:%S')
    summary = 'text'

    engine = create_engine(
        'mysql+pymysql://<username>:<host>:3306/<database_name>?charset=utf8'
    )
    metadata = MetaData(engine, reflect=True)
    rsstracker = metadata.tables['rsstracker']

    for url in RSS_URLS:
        feed = feedparser.parse(url)
        for post in feed.entries:
            (
                rsstracker.insert()
                    .values(
                        time=time,
                        title=post.title,
                        url=post.link,
                        summary=summary,
                    )
                    .execute()
            )


def main():
    sched.configure()
    sched.start()


if __name__ == '__main__':
    main()

The time column seems a bit odd, I would have expected a TIMESTAMP or DATETIME here and not a string that throws away much of the information, just leaving the abbreviated week day and the time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM