[英]How to use Newspaper3k library without downloading articles?
假設我有新聞文章的本地副本。 我怎樣才能在這些文章上運行報紙? 根據文檔,報紙庫的正常使用是這樣的:
from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article.download()
article = Article(url)
article.parse()
# ...
就我而言,我不需要從網頁下載文章,因為我已經有了該頁面的本地副本。 如何在網頁的本地副本上使用報紙?
你可以,只是有點hacky。 舉個例子
import requests
from newspaper import Article
url = 'https://www.cnn.com/2019/06/19/india/chennai-water-crisis-intl-hnk/index.html'
# get sample html
r = requests.get(url)
# save to file
with open('file.html', 'wb') as fh:
fh.write(r.content)
a = Article(url)
# set html manually
with open("file.html", 'rb') as fh:
a.html = fh.read()
# need to set download_state to 2 for this to work
a.download_state = 2
a.parse()
# Now the article should be populated
a.text
# 'New Delhi (CNN) The floor...'
凡download_state
來自於片段newspaper.article.py
:
# /path/to/site-packages/newspaper/article.py
class ArticleDownloadState(object):
NOT_STARTED = 0
FAILED_RESPONSE = 1
SUCCESS = 2
~snip~
# This is why you need to set that variable
class Article:
def __init__(...):
~snip~
# Keep state for downloads and parsing
self.is_parsed = False
self.download_state = ArticleDownloadState.NOT_STARTED
self.download_exception_msg = None
def parse(self):
# will throw exception if download_state isn't 2
self.throw_if_not_downloaded_verbose()
self.doc = self.config.get_parser().fromstring(self.html)
作為替代方案,您可以覆蓋該類以與parse
函數相同:
from newspaper import Article
import io
class localArticle(Article):
def __init__(self, url, **kwargs):
# set url to be file_name in __init__ if it's a file handle
super().__init__(url if isinstance(url, str) else url.name, **kwargs)
# set standalone _url attr so that parse will work as expected
self._url = url
def parse(self):
# sets html and things for you
if isinstance(self._url, str):
with open(self._url, 'rb') as fh:
self.html = fh.read()
elif isinstance(self._url, (io.TextIOWrapper, io.BufferedReader)):
self.html = self._url.read()
else:
raise TypeError(f"Expected file path or file-like object, got {self._url.__class__}")
self.download_state = 2
# now parse will continue on with the proper params set
super(localArticle, self).parse()
a = localArticle('file.html') # pass your file name here
a.parse()
a.text[:10]
# 'New Delhi '
# or you can give it a file handle
with open("file.html", 'rb') as fh:
a = localArticle(fh)
a.parse()
a.text[:10]
# 'New Delhi '
確實有一種官方方法可以解決這里提到的這個問題
在程序中加載 html 后,您可以使用set_html()
方法將其設置為article.html
import newspaper
with open("file.html", 'rb') as fh:
ht = fh.read()
article = newspaper.Article(url = ' ')
article.set_html(ht)
article.parse()
我相信您已經解決了這個問題,但Newspaper具有處理本地存儲的 HTML 文件的能力。
from newspaper import Article
# Downloading the HTML for the article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()
article.parse()
with open('fox13no.html', 'w') as fileout:
fileout.write(article.html)
# Read the locally stored HTML with Newspaper
with open("fox13no.html", 'r') as f:
# note the URL string is empty
article = Article('', language='en')
article.download(input_html=f.read())
article.parse()
print(article.title)
New Year, new laws: Obamacare, pot, guns and drones
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.