简体   繁体   中英

Web scraping articles from Google News

I am trying to web scrape googlenews with the gnews package. However, I don't know how to do web scraping for older articles like, for example, articles from 2010.

from gnews import GNews
from newspaper import Article
import pandas as pd
import datetime

google_news = GNews(language='es', country='Argentina', period = '7d')
argentina_news = google_news.get_news('protesta clarin')
print(len(argentina_news))

this code works perfectly to get recent articles but I need older articles. I saw https://github.com/ranahaani/GNews#todo and something like the following appears:

google_news = GNews(language='es', country='Argentina', period='7d', start_date='01-01-2015', end_date='01-01-2016', max_results=10, exclude_websites=['yahoo.com', 'cnn.com'],
                    proxy=proxy)

but when I try star_date I get:

TypeError: __init__() got an unexpected keyword argument 'start_date'

can anyone help to get articles for specific dates. Thank you very mucha guys!

The example code is incorrect for gnews==0.2.7 which is the latest you can install off PyPI via pip (or whatever). The documentation is for the unreleased mainline code that you can get directly off their git source.

Confirmed by inspecting the GNews::__init__ method, and the method doesn't have keyword args for start_date or end_date :

In [1]: import gnews

In [2]: gnews.GNews.__init__??
Signature:
gnews.GNews.__init__(
    self,
    language='en',
    country='US',
    max_results=100,
    period=None,
    exclude_websites=None,
    proxy=None,
)
Docstring: Initialize self.  See help(type(self)) for accurate signature.
Source:
    def __init__(self, language="en", country="US", max_results=100, period=None, exclude_websites=None, proxy=None):
        self.countries = tuple(AVAILABLE_COUNTRIES),
        self.languages = tuple(AVAILABLE_LANGUAGES),
        self._max_results = max_results
        self._language = language
        self._country = country
        self._period = period
        self._exclude_websites = exclude_websites if exclude_websites and isinstance(exclude_websites, list) else []
        self._proxy = {'http': proxy, 'https': proxy} if proxy else None
File:      ~/src/news-test/.venv/lib/python3.10/site-packages/gnews/gnews.py
Type:      function

If you want the start_date and end_date functionality, that was only added rather recently, so you will need to install the module off their git source.

# use whatever you use to uninstall any pre-existing gnews module
pip uninstall gnews

# install from the project's git main branch
pip install git+https://github.com/ranahaani/GNews.git

Now you can use the start/end functionality:

import datetime

import gnews

start = datetime.date(2015, 1, 15)
end = datetime.date(2015, 1, 16)

google_news = GNews(language='es', country='Argentina', start_date=start, end_date=end)
rsp = google_news.get_news("protesta")
print(rsp)

I get this as a result:

[{'title': 'Latin Roots: The Protest Music Of South America - NPR',
  'description': 'Latin Roots: The Protest Music Of South America  NPR',
  'published date': 'Thu, 15 Jan 2015 08:00:00 GMT',
  'url': 'https://www.npr.org/sections/world-cafe/2015/01/15/377491862/latin-roots-the-protest-music-of-south-america',
  'publisher': {'href': 'https://www.npr.org', 'title': 'NPR'}}]

Also note:

  • period is ignored if you set start_date and end_date
  • Their documentation shows you can pass the dates as tuples like (2015, 1, 15) . This doesn't seem to work - just be safe and pass a datetime object.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM