報紙圖書館

Question

作為使用python的絕對新手，我偶然發現了使用報紙庫擴展的一些困難。 我的目標是定期使用報紙擴展名下載德國新聞網站“ tagesschau”的所有新文章以及CNN的所有文章，以構建一個我可以在幾年內進行分析的數據棧。 如果我做對了，我可以使用以下命令下載所有文章並將其抓取到python庫中。

import newspaper
from newspaper import news_pool

tagesschau_paper = newspaper.build('http://tagesschau.de')
cnn_paper = newspaper.build('http://cnn.com')

papers = [tagesschau_paper, cnn_paper]
news_pool.set(papers, threads_per_source=2) # (3*2) = 6 threads total
news_pool.join()`

如果這是下載所有文章的正確方法，那么如何提取和保存python之外的文章呢？ 還是將這些文章保存在python中，以便在再次重新啟動python時可以重用它們？

謝謝你的幫助。

Answer 1

以下代碼將以HTML格式保存下載的文章。 在文件夾中，您將找到。 tagesschau_paper0.html, tagesschau_paper1.html, tagesschau_paper2.html, .....

import newspaper
from newspaper import news_pool

tagesschau_paper = newspaper.build('http://tagesschau.de')
cnn_paper = newspaper.build('http://cnn.com')

papers = [tagesschau_paper, cnn_paper]
news_pool.set(papers, threads_per_source=2)
news_pool.join()

for i in range (tagesschau_paper.size()): 
    with open("tagesschau_paper{}.html".format(i), "w") as file:
    file.write(tagesschau_paper.articles[i].html)

注意： news_pool從CNN中什么都沒有得到，所以我跳過了為它編寫代碼的過程。 如果檢查cnn_paper.size() ，則結果為0 。 您必須導入並使用Source 。

上面的代碼可以作為示例以其他格式保存文章，例如txt，也可以僅保存文章中需要的部分，例如作者，正文，publish_date。

Answer 2

您可以使用pickle將對象保存在python之外，並在以后重新打開它們：

file_Name = "testfile"
# open the file for writing
fileObject = open(file_Name,'wb') 

# this writes the object news_pool to the
# file named 'testfile'
pickle.dump(news_pool,fileObject)   

# here we close the fileObject
fileObject.close()
# we open the file for reading
fileObject = open(file_Name,'r')  
# load the object from the file into var news_pool_reopen
news_pool_reopen = pickle.load(fileObject)

報紙圖書館

問題描述

2 個解決方案

解決方案1
0 已采納 2018-11-16 08:49:07

解決方案2
-1 2018-11-13 22:41:39

報紙圖書館

問題描述

2 個解決方案

解決方案1 0 已采納 2018-11-16 08:49:07

解決方案2 -1 2018-11-13 22:41:39

解決方案1
0 已采納 2018-11-16 08:49:07

解決方案2
-1 2018-11-13 22:41:39