简体   繁体   English

web 抓取 python 的新闻文章时出现 ArticleException 错误

[英]ArticleException error in web scraping news articles by python

I am trying to web scrape news articles by certain keywords.我正在尝试 web 按某些关键字抓取新闻文章。 I use Python 3. However, I am not able to get all the articles from the newspaper.我使用 Python 3。但是,我无法从报纸上获得所有文章。 After scraping some articles as output in the csv file I get ArticleException error.csv文件中将一些文章抓取为 output 后,我收到ArticleException错误。 Could anyone help me with this?谁能帮我解决这个问题? Ideally, I would like to solve the problem and download all the related articles from the newspaper website.理想情况下,我想解决问题并从报纸网站下载所有相关文章。 Otherwise, it would also be useful to just skip the URL that shows error and continue from the next one.否则,跳过显示错误的 URL 并从下一个继续也是有用的。 Thanks in advance for your help.在此先感谢您的帮助。

This is the code I am using:这是我正在使用的代码:

import urllib.request
import newspaper
from newspaper import Article
import csv, os
from bs4 import BeautifulSoup
import urllib

req_keywords = ['coronavirus', 'covid-19']

newspaper_base_url = 'http://www.thedailystar.net'
category = 'country'

def checkif_kw_exist(list_one, list_two):
    common_kw = set(list_one) & set(list_two)
    if len(common_kw) == 0: return False, common_kw
    else: return True, common_kw

def get_article_info(url):
    a = Article(url)
    a.download()
    a.parse()
    a.nlp()
    success, checked_kws = checkif_kw_exist(req_keywords, a.text.split())
    if success:
        return [url, a.publish_date, a.title, a.text]
    else: return False


output_file = "J:/B/output.csv"
if not os.path.exists(output_file):
    open(output_file, 'w').close() 


for index in range(1,50000,1):

    page_soup = BeautifulSoup( urllib.request.urlopen(page_url).read())

    primary_tag = page_soup.find_all("h4", attrs={"class": "pad-bottom-small"})

    for tag in primary_tag:

        url = tag.find("a")
        #print (url)
        url = newspaper_base_url + url.get('href')
        result = get_article_info(url)
        if result is not False:
            with open(output_file, 'a', encoding='utf-8') as f:
                writeFile = csv.writer(f)
                writeFile.writerow(result)
                f.close
        else: 
            pass

This is the error I am getting:这是我得到的错误:

---------------------------------------------------------------------------
ArticleException                          Traceback (most recent call last)
<ipython-input-1-991b432d3bd0> in <module>
     65         #print (url)
     66         url = newspaper_base_url + url.get('href')
---> 67         result = get_article_info(url)
     68         if result is not False:
     69             with open(output_file, 'a', encoding='utf-8') as f:

<ipython-input-1-991b432d3bd0> in get_article_info(url)
     28     a = Article(url)
     29     a.download()
---> 30     a.parse()
     31     a.nlp()
     32     success, checked_kws = checkif_kw_exist(req_keywords, a.text.split())

~\Anaconda3\lib\site-packages\newspaper\article.py in parse(self)
    189 
    190     def parse(self):
--> 191         self.throw_if_not_downloaded_verbose()
    192 
    193         self.doc = self.config.get_parser().fromstring(self.html)

~\Anaconda3\lib\site-packages\newspaper\article.py in throw_if_not_downloaded_verbose(self)
    530         elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
    531             raise ArticleException('Article `download()` failed with %s on URL %s' %
--> 532                   (self.download_exception_msg, self.url))
    533 
    534     def throw_if_not_parsed_verbose(self):

ArticleException: Article `download()` failed with HTTPSConnectionPool(host='www.thedailystar.net', port=443): Read timed out. (read timeout=7) on URL http://www.thedailystar.net/ugc-asks-private-universities-stop-admissions-grades-without-test-for-coronavirus-pandemic-1890151

The quickest way to 'skip' failures related to the downloaded content is to use a try/except as follows:与下载内容相关的“跳过”失败的最快方法是使用try/except ,如下所示:

def get_article_info(url):
  a = Article(url)
  try:
    a.download()
    a.parse()
    a.nlp()
    success, checked_kws = checkif_kw_exist(req_keywords, a.text.split())
    if success:
      return [url, a.publish_date, a.title, a.text]
    else: return False
  except:
    return False

Using an except to catch every possible exception, and ignore it, isn't recommended, and this answer would be downvoted if I didn't suggest that you deal with exceptions a little better.不建议使用except来捕获所有可能的异常并忽略它,如果我不建议您更好地处理异常,这个答案将被否决。 You did also ask about solving the issue.你也问过解决这个问题。 Without reading the documentation for libraries you import, you won't know what exceptions might occur, so printing out details of exceptions while you're skipping them will give you the details, like the ArticleException you are getting now.如果不阅读您导入的库的文档,您将不知道可能会发生哪些异常,因此在您跳过它们时打印出异常的详细信息将为您提供详细信息,例如您现在得到的ArticleException And you can start added individual except sections to deal with them for the ones you have already encountered:您可以开始添加except的部分,以处理您已经遇到的部分:

def get_article_info(url):
  a = Article(url)
  try:
    a.download()
    a.parse()
    a.nlp()
    success, checked_kws = checkif_kw_exist(req_keywords, a.text.split())
    if success:
      return [url, a.publish_date, a.title, a.text]
    else: 
      return False
   except ArticleException as ae:
     print (ae)
     return False
   except Exception as e:
     print(e)
     return False

The ArticleException you are getting is telling you that you are getting a timeout error, which means the response from the Daily Star hasn't completed within a time limit.你得到的ArticleException告诉你你得到一个timeout错误,这意味着来自 Daily Star 的响应没有在时间限制内完成。 Maybe it's very busy:) You could try downloading several times before giving up.也许它很忙:)您可以尝试下载几次再放弃。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM