簡體   English   中英

使用 Python 中的 NewsPaper 庫將多個新聞文章來源抓取到一個列表中?

[英]Scraping multiple news article sources into one single list with NewsPaper library in Python?

親愛的 Stackoverflow 社區!

這是關於我在此處發布的上一個問題的后續問題。

我想將帶有 NewsPaper 庫的新聞報紙 URL 從多個來源中提取到一個列表中。 這對一個來源很有效,但是一旦我添加了第二個來源鏈接,它就只提取第二個來源的 URL。

    import feedparser as fp
    import newspaper
    from newspaper import Article

    website = {"cnn": {"link": "edition.cnn.com", "rss": "rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "cnbc.com", "rss": "cnbc.com/id/10000664/device/rss/rss.html"}} A

    for source, value in website.items():
        if 'rss' in value:
            d = fp.parse(value['rss']) 
            #if there is an RSS value for a company, it will be extracted into d
            article_list = []

            for entry in d.entries:
                if hasattr(entry, 'published'):
                    article = {}
                    article['link'] = entry.link
                    article_list.append(article['link'])
                    print(article['link'])

輸出如下,僅附加了來自第二個來源的鏈接:

    ['https://www.cnbc.com/2019/10/23/why-china-isnt-cutting-lending-rates-like-the-rest-of-the-world.html', 'https://www.cnbc.com/2019/10/22/stocks-making-the-biggest-moves-after-hours-snap-texas-instruments-chipotle-and-more.html' , ...]

我希望將兩個來源的所有 URL 提取到列表中。 有誰知道這個問題的解決方案? 非常感謝您提前!!

article_list在您的第一個for循環中被覆蓋。 每次迭代新源時,您article_list都會設置為一個新的空列表,從而有效地丟失來自先前源的所有信息。 這就是為什么最后你只有一個來源的信息,最后一個

您應該在開頭初始化article_list而不是覆蓋它。

import feedparser as fp
import newspaper
from newspaper import Article

website = {"cnn": {"link": "edition.cnn.com", "rss": "rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "cnbc.com", "rss": "cnbc.com/id/10000664/device/rss/rss.html"}} A

article_list = [] # INIT ONCE
for source, value in website.items():
    if 'rss' in value:
        d = fp.parse(value['rss']) 
        #if there is an RSS value for a company, it will be extracted into d
        # article_list = [] THIS IS WHERE IT WAS BEING OVERWRITTEN

        for entry in d.entries:
            if hasattr(entry, 'published'):
                article = {}
                article['link'] = entry.link
                article_list.append(article['link'])
                print(article['link'])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM