[英]ArticleException error in web scraping news articles by python
I am trying to web scrape news articles by certain keywords.我正在尝试 web 按某些关键字抓取新闻文章。 I use Python 3. However, I am not able to get all the articles from the newspaper.
我使用 Python 3。但是,我无法从报纸上获得所有文章。 After scraping some articles as output in the
csv
file I get ArticleException
error.在
csv
文件中将一些文章抓取为 output 后,我收到ArticleException
错误。 Could anyone help me with this?谁能帮我解决这个问题? Ideally, I would like to solve the problem and download all the related articles from the newspaper website.
理想情况下,我想解决问题并从报纸网站下载所有相关文章。 Otherwise, it would also be useful to just skip the URL that shows error and continue from the next one.
否则,跳过显示错误的 URL 并从下一个继续也是有用的。 Thanks in advance for your help.
在此先感谢您的帮助。
This is the code I am using:这是我正在使用的代码:
import urllib.request
import newspaper
from newspaper import Article
import csv, os
from bs4 import BeautifulSoup
import urllib
req_keywords = ['coronavirus', 'covid-19']
newspaper_base_url = 'http://www.thedailystar.net'
category = 'country'
def checkif_kw_exist(list_one, list_two):
common_kw = set(list_one) & set(list_two)
if len(common_kw) == 0: return False, common_kw
else: return True, common_kw
def get_article_info(url):
a = Article(url)
a.download()
a.parse()
a.nlp()
success, checked_kws = checkif_kw_exist(req_keywords, a.text.split())
if success:
return [url, a.publish_date, a.title, a.text]
else: return False
output_file = "J:/B/output.csv"
if not os.path.exists(output_file):
open(output_file, 'w').close()
for index in range(1,50000,1):
page_soup = BeautifulSoup( urllib.request.urlopen(page_url).read())
primary_tag = page_soup.find_all("h4", attrs={"class": "pad-bottom-small"})
for tag in primary_tag:
url = tag.find("a")
#print (url)
url = newspaper_base_url + url.get('href')
result = get_article_info(url)
if result is not False:
with open(output_file, 'a', encoding='utf-8') as f:
writeFile = csv.writer(f)
writeFile.writerow(result)
f.close
else:
pass
This is the error I am getting:这是我得到的错误:
---------------------------------------------------------------------------
ArticleException Traceback (most recent call last)
<ipython-input-1-991b432d3bd0> in <module>
65 #print (url)
66 url = newspaper_base_url + url.get('href')
---> 67 result = get_article_info(url)
68 if result is not False:
69 with open(output_file, 'a', encoding='utf-8') as f:
<ipython-input-1-991b432d3bd0> in get_article_info(url)
28 a = Article(url)
29 a.download()
---> 30 a.parse()
31 a.nlp()
32 success, checked_kws = checkif_kw_exist(req_keywords, a.text.split())
~\Anaconda3\lib\site-packages\newspaper\article.py in parse(self)
189
190 def parse(self):
--> 191 self.throw_if_not_downloaded_verbose()
192
193 self.doc = self.config.get_parser().fromstring(self.html)
~\Anaconda3\lib\site-packages\newspaper\article.py in throw_if_not_downloaded_verbose(self)
530 elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
531 raise ArticleException('Article `download()` failed with %s on URL %s' %
--> 532 (self.download_exception_msg, self.url))
533
534 def throw_if_not_parsed_verbose(self):
ArticleException: Article `download()` failed with HTTPSConnectionPool(host='www.thedailystar.net', port=443): Read timed out. (read timeout=7) on URL http://www.thedailystar.net/ugc-asks-private-universities-stop-admissions-grades-without-test-for-coronavirus-pandemic-1890151
The quickest way to 'skip' failures related to the downloaded content is to use a try/except
as follows:与下载内容相关的“跳过”失败的最快方法是使用
try/except
,如下所示:
def get_article_info(url):
a = Article(url)
try:
a.download()
a.parse()
a.nlp()
success, checked_kws = checkif_kw_exist(req_keywords, a.text.split())
if success:
return [url, a.publish_date, a.title, a.text]
else: return False
except:
return False
Using an except
to catch every possible exception, and ignore it, isn't recommended, and this answer would be downvoted if I didn't suggest that you deal with exceptions a little better.不建议使用
except
来捕获所有可能的异常并忽略它,如果我不建议您更好地处理异常,这个答案将被否决。 You did also ask about solving the issue.你也问过解决这个问题。 Without reading the documentation for libraries you import, you won't know what exceptions might occur, so printing out details of exceptions while you're skipping them will give you the details, like the
ArticleException
you are getting now.如果不阅读您导入的库的文档,您将不知道可能会发生哪些异常,因此在您跳过它们时打印出异常的详细信息将为您提供详细信息,例如您现在得到的
ArticleException
。 And you can start added individual except
sections to deal with them for the ones you have already encountered:您可以开始添加
except
的部分,以处理您已经遇到的部分:
def get_article_info(url):
a = Article(url)
try:
a.download()
a.parse()
a.nlp()
success, checked_kws = checkif_kw_exist(req_keywords, a.text.split())
if success:
return [url, a.publish_date, a.title, a.text]
else:
return False
except ArticleException as ae:
print (ae)
return False
except Exception as e:
print(e)
return False
The ArticleException
you are getting is telling you that you are getting a timeout
error, which means the response from the Daily Star hasn't completed within a time limit.你得到的
ArticleException
告诉你你得到一个timeout
错误,这意味着来自 Daily Star 的响应没有在时间限制内完成。 Maybe it's very busy:) You could try downloading several times before giving up.也许它很忙:)您可以尝试下载几次再放弃。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.