简体   繁体   中英

Scraping Google News with pygooglenews

I am trying to do scraping from Google News with pygooglenews . I am trying to scrape more than 100 articles at a time (as google sets limit at 100) by changing the target dates using for loop. The below is what I have so far but I keep getting error message

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-84-4ada7169ebe7> in <module>
----> 1 df = pd.DataFrame(get_news('Banana'))
      2 writer = pd.ExcelWriter('My Result.xlsx', engine='xlsxwriter')
      3 df.to_excel(writer, sheet_name='Results', index=False)
      4 writer.save()

<ipython-input-79-c5266f97934d> in get_titles(search)
      9 
     10     for date in date_list[:-1]:
---> 11         search = gn.search(search, from_=date, to_=date_list[date_list.index(date)])
     12         newsitem = search['entries']
     13 

~\AppData\Roaming\Python\Python37\site-packages\pygooglenews\__init__.py in search(self, query, helper, when, from_, to_, proxies, scraping_bee)
    140         if from_ and not when:
    141             from_ = self.__from_to_helper(validate=from_)
--> 142             query += ' after:' + from_
    143 
    144         if to_ and not when:

TypeError: unsupported operand type(s) for +=: 'dict' and 'str'
import pandas as pd
from pygooglenews import GoogleNews
import datetime

gn = GoogleNews()

def get_news(search):
    stories = []
    start_date = datetime.date(2021,3,1)
    end_date = datetime.date(2021,3,5)
    delta = datetime.timedelta(days=1)
    date_list = pd.date_range(start_date, end_date).tolist()
    
    for date in date_list[:-1]:
        search = gn.search(search, from_=date.strftime('%Y-%m-%d'), to_=(date+delta).strftime('%Y-%m-%d'))
        newsitem = search['entries']

        for item in newsitem:
            story = {
                'title':item.title,
                'link':item.link,
                'published':item.published
            }
            stories.append(story)

    return stories

df = pd.DataFrame(get_news('Banana'))

Thank you in advance.

It looks like you are correctly passing in a string into get_news() , which is then passed on as the first argument ( search ) into gn.search() .

However, you reassign search to the result of gn.search() in the line:

  search = gn.search(search, from_=date.strftime('%Y-%m-%d'), to_=(date+delta).strftime('%Y-%m-%d'))
# ^^^^^^
# gets overwritten with the result of gn.search()

In the next iteration this reassigned search is passed into gn.search() which it doesn't like.

If you look at the code in pygooglenews , it looks like gn.search() is returning a dict , which would explain the error.

To fix this, simply use a different variable, eg:

  result = gn.search(search, from_=date.strftime('%Y-%m-%d'), to_=(date+delta).strftime('%Y-%m-%d'))
  newsitem = result['entries']

I know that pygooglenews has a limit of 100 articles, so you must to make a loop in which it will scrape every day separately.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM