[英]Scraping multiple news article sources into one single list with NewsPaper library in Python?
Dear Stackoverflow community!亲爱的 Stackoverflow 社区!
This is a follow up question regarding a previous question I posted here .这是关于我在此处发布的上一个问题的后续问题。
I would like to extract news paper URLS with the NewsPaper library from MULTIPLE sources into one SINGLE list.我想将带有 NewsPaper 库的新闻报纸 URL 从多个来源中提取到一个列表中。 This worked well for one source, but as soon as I add a second source link, it extracts only the URLs of the second one.
这对一个来源很有效,但是一旦我添加了第二个来源链接,它就只提取第二个来源的 URL。
import feedparser as fp
import newspaper
from newspaper import Article
website = {"cnn": {"link": "edition.cnn.com", "rss": "rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "cnbc.com", "rss": "cnbc.com/id/10000664/device/rss/rss.html"}} A
for source, value in website.items():
if 'rss' in value:
d = fp.parse(value['rss'])
#if there is an RSS value for a company, it will be extracted into d
article_list = []
for entry in d.entries:
if hasattr(entry, 'published'):
article = {}
article['link'] = entry.link
article_list.append(article['link'])
print(article['link'])
The ouput is as follows, only the links from the second source are appended:输出如下,仅附加了来自第二个来源的链接:
['https://www.cnbc.com/2019/10/23/why-china-isnt-cutting-lending-rates-like-the-rest-of-the-world.html', 'https://www.cnbc.com/2019/10/22/stocks-making-the-biggest-moves-after-hours-snap-texas-instruments-chipotle-and-more.html' , ...]
I would like all the URLs from both sources to be extracted into the list.我希望将两个来源的所有 URL 提取到列表中。 Does anyone know a solution to this problem?
有谁知道这个问题的解决方案? Thank you very much in advance!!
非常感谢您提前!!
article_list
is being overwritten in your first for
loop. article_list
在您的第一个for
循环中被覆盖。 Each time you iterate over a new source you article_list
is set to a new empty list, effectively losing all information from the previous source.每次迭代新源时,您
article_list
都会设置为一个新的空列表,从而有效地丢失来自先前源的所有信息。 That's why at the end you only have information from one source, the last one这就是为什么最后你只有一个来源的信息,最后一个
You should initialize article_list
at the beginning and not overwrite it.您应该在开头初始化
article_list
而不是覆盖它。
import feedparser as fp
import newspaper
from newspaper import Article
website = {"cnn": {"link": "edition.cnn.com", "rss": "rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "cnbc.com", "rss": "cnbc.com/id/10000664/device/rss/rss.html"}} A
article_list = [] # INIT ONCE
for source, value in website.items():
if 'rss' in value:
d = fp.parse(value['rss'])
#if there is an RSS value for a company, it will be extracted into d
# article_list = [] THIS IS WHERE IT WAS BEING OVERWRITTEN
for entry in d.entries:
if hasattr(entry, 'published'):
article = {}
article['link'] = entry.link
article_list.append(article['link'])
print(article['link'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.