繁体   English   中英

使用 Python 中的 NewsPaper 库将多个新闻文章来源抓取到一个列表中?

[英]Scraping multiple news article sources into one single list with NewsPaper library in Python?

亲爱的 Stackoverflow 社区!

这是关于我在此处发布的上一个问题的后续问题。

我想将带有 NewsPaper 库的新闻报纸 URL 从多个来源中提取到一个列表中。 这对一个来源很有效,但是一旦我添加了第二个来源链接,它就只提取第二个来源的 URL。

    import feedparser as fp
    import newspaper
    from newspaper import Article

    website = {"cnn": {"link": "edition.cnn.com", "rss": "rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "cnbc.com", "rss": "cnbc.com/id/10000664/device/rss/rss.html"}} A

    for source, value in website.items():
        if 'rss' in value:
            d = fp.parse(value['rss']) 
            #if there is an RSS value for a company, it will be extracted into d
            article_list = []

            for entry in d.entries:
                if hasattr(entry, 'published'):
                    article = {}
                    article['link'] = entry.link
                    article_list.append(article['link'])
                    print(article['link'])

输出如下,仅附加了来自第二个来源的链接:

    ['https://www.cnbc.com/2019/10/23/why-china-isnt-cutting-lending-rates-like-the-rest-of-the-world.html', 'https://www.cnbc.com/2019/10/22/stocks-making-the-biggest-moves-after-hours-snap-texas-instruments-chipotle-and-more.html' , ...]

我希望将两个来源的所有 URL 提取到列表中。 有谁知道这个问题的解决方案? 非常感谢您提前!!

article_list在您的第一个for循环中被覆盖。 每次迭代新源时,您article_list都会设置为一个新的空列表,从而有效地丢失来自先前源的所有信息。 这就是为什么最后你只有一个来源的信息,最后一个

您应该在开头初始化article_list而不是覆盖它。

import feedparser as fp
import newspaper
from newspaper import Article

website = {"cnn": {"link": "edition.cnn.com", "rss": "rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "cnbc.com", "rss": "cnbc.com/id/10000664/device/rss/rss.html"}} A

article_list = [] # INIT ONCE
for source, value in website.items():
    if 'rss' in value:
        d = fp.parse(value['rss']) 
        #if there is an RSS value for a company, it will be extracted into d
        # article_list = [] THIS IS WHERE IT WAS BEING OVERWRITTEN

        for entry in d.entries:
            if hasattr(entry, 'published'):
                article = {}
                article['link'] = entry.link
                article_list.append(article['link'])
                print(article['link'])

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM