简体   繁体   中英

how to chain celery tasks

I want to chain celery tasks in a STANDARD way.

I've a json file. Inside that file, there many harcoded urls. I need to scrap those links plus scrap the links which are found while scraping those links.

Currently, I'm doing like this.

for each_news_source, news_categories in rss_obj.iteritems():
    for each_category in news_categories:
        category = each_category['category']
        rss_link = each_category['feed']
        json_id = each_category['json']
        try:
            list_of_links = getrsslinks(rss_link)
            for link in list_of_links:
                scrape_link.delay(link, json_id, category)
        except Exception,e:
            print "Invalid url", str(e)

I want something where getrsslinks is also a celery task and then the scrapping of list of urls which is returned by getrsslinks should also be another celery task.

It follows this pattern

harcodeJSONURL1--
               --`getrsslinks` (celery task)
                               --scrap link 1 (celery task)
                               --scrap link 2 (celery task)
                               --scrap link 3 (celery task)
                               --scrap link 4 (celery task)

harcodeJSONURL2--
               --`getrsslinks` (celery task)
                               --scrap link 1 (celery task)
                               --scrap link 2 (celery task)
                               --scrap link 3 (celery task)
                               --scrap link 4 (celery task)

and so on..

How can I do this??

Take a look at the subtask options in Celery. In your case groups should help. You just need to call a scrape_link group inside getrsslinks .

from celery import group

@app.task
def getrsslinks(rsslink, json_id, category):
    # do processing

    # Call scrape links
    scrape_jobs = group(scrape_link.s(link, json_id, category) for link in link_list)
    scrape_jobs.apply_async()
    ...

You might want getrsslinks to return scrape_jobs to monitor the jobs easier. Then when parsing your json file, you would call getrsslinks like so.

for each_news_source, news_categories in rss_obj.iteritems():
    for each_category in news_categories:
        category = each_category['category']
        rss_link = each_category['feed']
        json_id = each_category['json']
        getrsslinks.delay(rss_link, json_id, category)

Finally, to monitor which links were invalid (since we replaced the try/except block) you need to store all the getrsslinks tasks and watch for success or failure. You could use apply_async with link_error for this.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM