how to chain celery tasks

Question

I want to chain celery tasks in a STANDARD way.

I've a json file. Inside that file, there many harcoded urls. I need to scrap those links plus scrap the links which are found while scraping those links.

Currently, I'm doing like this.

for each_news_source, news_categories in rss_obj.iteritems():
    for each_category in news_categories:
        category = each_category['category']
        rss_link = each_category['feed']
        json_id = each_category['json']
        try:
            list_of_links = getrsslinks(rss_link)
            for link in list_of_links:
                scrape_link.delay(link, json_id, category)
        except Exception,e:
            print "Invalid url", str(e)

I want something where getrsslinks is also a celery task and then the scrapping of list of urls which is returned by getrsslinks should also be another celery task.

It follows this pattern

harcodeJSONURL1--
               --`getrsslinks` (celery task)
                               --scrap link 1 (celery task)
                               --scrap link 2 (celery task)
                               --scrap link 3 (celery task)
                               --scrap link 4 (celery task)

harcodeJSONURL2--
               --`getrsslinks` (celery task)
                               --scrap link 1 (celery task)
                               --scrap link 2 (celery task)
                               --scrap link 3 (celery task)
                               --scrap link 4 (celery task)

and so on..

How can I do this??

Answer 1

Take a look at the subtask options in Celery. In your case groups should help. You just need to call a scrape_link group inside getrsslinks .

from celery import group

@app.task
def getrsslinks(rsslink, json_id, category):
    # do processing

    # Call scrape links
    scrape_jobs = group(scrape_link.s(link, json_id, category) for link in link_list)
    scrape_jobs.apply_async()
    ...

You might want getrsslinks to return scrape_jobs to monitor the jobs easier. Then when parsing your json file, you would call getrsslinks like so.

for each_news_source, news_categories in rss_obj.iteritems():
    for each_category in news_categories:
        category = each_category['category']
        rss_link = each_category['feed']
        json_id = each_category['json']
        getrsslinks.delay(rss_link, json_id, category)

Finally, to monitor which links were invalid (since we replaced the try/except block) you need to store all the getrsslinks tasks and watch for success or failure. You could use apply_async with link_error for this.

how to chain celery tasks

Question

1 answers

solution1
1 2015-02-05 07:47:17

how to chain celery tasks

Question

1 answers

solution1 1 2015-02-05 07:47:17

solution1
1 2015-02-05 07:47:17