使用 Python 中的 NewsPaper 库将多个新闻文章来源抓取到一个列表中？

Question

Dear Stackoverflow community!亲爱的 Stackoverflow 社区！

This is a follow up question regarding a previous question I posted here .这是关于我在此处发布的上一个问题的后续问题。

I would like to extract news paper URLS with the NewsPaper library from MULTIPLE sources into one SINGLE list.我想将带有 NewsPaper 库的新闻报纸 URL 从多个来源中提取到一个列表中。 This worked well for one source, but as soon as I add a second source link, it extracts only the URLs of the second one.这对一个来源很有效，但是一旦我添加了第二个来源链接，它就只提取第二个来源的 URL。

    import feedparser as fp
    import newspaper
    from newspaper import Article

    website = {"cnn": {"link": "edition.cnn.com", "rss": "rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "cnbc.com", "rss": "cnbc.com/id/10000664/device/rss/rss.html"}} A

    for source, value in website.items():
        if 'rss' in value:
            d = fp.parse(value['rss']) 
            #if there is an RSS value for a company, it will be extracted into d
            article_list = []

            for entry in d.entries:
                if hasattr(entry, 'published'):
                    article = {}
                    article['link'] = entry.link
                    article_list.append(article['link'])
                    print(article['link'])

The ouput is as follows, only the links from the second source are appended:输出如下，仅附加了来自第二个来源的链接：

    ['https://www.cnbc.com/2019/10/23/why-china-isnt-cutting-lending-rates-like-the-rest-of-the-world.html', 'https://www.cnbc.com/2019/10/22/stocks-making-the-biggest-moves-after-hours-snap-texas-instruments-chipotle-and-more.html' , ...]

I would like all the URLs from both sources to be extracted into the list.我希望将两个来源的所有 URL 提取到列表中。 Does anyone know a solution to this problem?有谁知道这个问题的解决方案？ Thank you very much in advance!!非常感谢您提前！！

Answer 1

article_list is being overwritten in your first for loop. article_list在您的第一个for循环中被覆盖。 Each time you iterate over a new source you article_list is set to a new empty list, effectively losing all information from the previous source.每次迭代新源时，您article_list都会设置为一个新的空列表，从而有效地丢失来自先前源的所有信息。 That's why at the end you only have information from one source, the last one这就是为什么最后你只有一个来源的信息，最后一个

You should initialize article_list at the beginning and not overwrite it.您应该在开头初始化article_list而不是覆盖它。

import feedparser as fp
import newspaper
from newspaper import Article

website = {"cnn": {"link": "edition.cnn.com", "rss": "rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "cnbc.com", "rss": "cnbc.com/id/10000664/device/rss/rss.html"}} A

article_list = [] # INIT ONCE
for source, value in website.items():
    if 'rss' in value:
        d = fp.parse(value['rss']) 
        #if there is an RSS value for a company, it will be extracted into d
        # article_list = [] THIS IS WHERE IT WAS BEING OVERWRITTEN

        for entry in d.entries:
            if hasattr(entry, 'published'):
                article = {}
                article['link'] = entry.link
                article_list.append(article['link'])
                print(article['link'])

使用 Python 中的 NewsPaper 库将多个新闻文章来源抓取到一个列表中？

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-10-23 12:39:07

使用 Python 中的 NewsPaper 库将多个新闻文章来源抓取到一个列表中？

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-10-23 12:39:07

解决方案1
0 已采纳 2019-10-23 12:39:07