简体   繁体   中英

Scraping multiple news article sources into one single list with NewsPaper library in Python?

Dear Stackoverflow community!

This is a follow up question regarding a previous question I posted here .

I would like to extract news paper URLS with the NewsPaper library from MULTIPLE sources into one SINGLE list. This worked well for one source, but as soon as I add a second source link, it extracts only the URLs of the second one.

    import feedparser as fp
    import newspaper
    from newspaper import Article

    website = {"cnn": {"link": "edition.cnn.com", "rss": "rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "cnbc.com", "rss": "cnbc.com/id/10000664/device/rss/rss.html"}} A

    for source, value in website.items():
        if 'rss' in value:
            d = fp.parse(value['rss']) 
            #if there is an RSS value for a company, it will be extracted into d
            article_list = []

            for entry in d.entries:
                if hasattr(entry, 'published'):
                    article = {}
                    article['link'] = entry.link
                    article_list.append(article['link'])
                    print(article['link'])

The ouput is as follows, only the links from the second source are appended:

    ['https://www.cnbc.com/2019/10/23/why-china-isnt-cutting-lending-rates-like-the-rest-of-the-world.html', 'https://www.cnbc.com/2019/10/22/stocks-making-the-biggest-moves-after-hours-snap-texas-instruments-chipotle-and-more.html' , ...]

I would like all the URLs from both sources to be extracted into the list. Does anyone know a solution to this problem? Thank you very much in advance!!

article_list is being overwritten in your first for loop. Each time you iterate over a new source you article_list is set to a new empty list, effectively losing all information from the previous source. That's why at the end you only have information from one source, the last one

You should initialize article_list at the beginning and not overwrite it.

import feedparser as fp
import newspaper
from newspaper import Article

website = {"cnn": {"link": "edition.cnn.com", "rss": "rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "cnbc.com", "rss": "cnbc.com/id/10000664/device/rss/rss.html"}} A

article_list = [] # INIT ONCE
for source, value in website.items():
    if 'rss' in value:
        d = fp.parse(value['rss']) 
        #if there is an RSS value for a company, it will be extracted into d
        # article_list = [] THIS IS WHERE IT WAS BEING OVERWRITTEN

        for entry in d.entries:
            if hasattr(entry, 'published'):
                article = {}
                article['link'] = entry.link
                article_list.append(article['link'])
                print(article['link'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM