使用 news3k 從新聞來源獲取更多文章 URL？

Question

當我做

import newspaper
paper = newspaper.build('http://cnn.com', memoize_articles=False)
print(len(paper.articles))

我看那個報紙從http://cnn.com上查到了902篇文章，我也覺得挺少的，考慮到他們每天發表很多文章，並且在網上發表文章很多年了。 這些真的是http://cnn.com上的所有文章嗎？ 如果沒有，有什么辦法可以找到其余文章的網址嗎？

Answer 1

Newspaper只查詢CNN 主頁上的items，所以該模塊不會查詢域上的所有類別（例如business、health 等）。 根據我的代碼，截至今天，只有 698 篇獨特的文章被Newspaper發現。 其中一些文章可能是相同的，因為有些 URL 具有哈希值，但看起來是同一篇文章。

PS 您可以查詢所有類別，但這需要Selenium和Newspaper 。

from newspaper import build

articles = []
urls_set = set()
cnn_articles = build('http://cnn.com', memoize_articles=False)
for article in cnn_articles.articles:
   # check to see if the article url is not within the urls_set
   if article.url not in urls_set:
     # add the unique article url to the set
     urls_set.add(article.url)
     articles.append(article.url)


print(len(articles))
# 698

使用 news3k 從新聞來源獲取更多文章 URL？

問題描述

1 個解決方案

解決方案1
1 2020-10-02 21:34:36

使用 news3k 從新聞來源獲取更多文章 URL？

問題描述

1 個解決方案

解決方案1 1 2020-10-02 21:34:36

解決方案1
1 2020-10-02 21:34:36