简体   繁体   中英

Scraping Google News results with Python and Beautiful Soup retrieves only the first page without headlines

I want to scrape headlines and paragraph texts from Google News search page based on the term searched. I want to do that for first n pages.

I have wrote a piece of code for scraping the first page only, but I do not know how to modify my url so that I can go to other pages to (page 2, 3...). That's the first problem that I have.

Second problem is that I do not know how to scrape headlines. It always returns me empty list. I have tried multiple solutions but it always returns me empty list. (I do not think that page is dynamic).

On the other hand scraping paragraph text below the headline works perfectly. Can you tell me how to fix these two problems?

This is my code:

from bs4 import BeautifulSoup
import requests

term = 'cocacola'

# this is only for page 1, how to go to page 2?
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# I think that this is not javascipt sensitive, its not dynamic            
headline_results = soup.find_all('a', class_="l lLrAF")
#headline_results = soup.find_all('h3', class_="r dO0Ag") # also does not work
print(headline_results) #empty list, IDK why?

paragraph_results = soup.find_all('div', class_='st')
print(paragraph_results) # works

Problem One : Flipping the page.

In order to move to the next page you need to include start keyword in your URL formatted string:

term = 'cocacola'
page = 2
url = 'https://www.google.com/search?q={}&source=lnms&tbm=nws&start={}'.format(
    term, (page - 1) * 10
)

Problem Two : Scraping the headlines.

Google regenerates the names of classes, ids, etc. of DOM elements so your approach is likely to fail every time you retrieve some new, uncached information.

Just add parameter 'start=10' to the search term. Like: https://www.google.com/search?q=beatifulsoup&ie=utf-8&oe=utf-8&aq=t&start=10

For dynamic behavior/loop over response pages use something like this:

from bs4 import BeautifulSoup
from request import get

term="beautifulsoup"
page_max = 5

# loop over pages
for page in range(0, page_max):
    url = "https://www.google.com/search?q={}&ie=utf-8&oe=utf-8&aq=t&start={}".format(term, 10*page)

    r = get(url) # you can also add headers here
    html_soup = BeautifulSoup(r.text, 'html.parser')

Link to partly identical question I answered before.


Alternatively, you can use Google News Result API from SerpApi. It's a paid API with a free trial.

Part of JSON output:

"news_results": [
  {
    "position": 1,
    "link": "https://www.stltoday.com/lifestyles/food-and-cooking/best-bites-pepperidge-farms-caramel-macchiato-flavored-milano-cookies/article_d43e59a0-b362-5cb0-bdef-6b7563d9fed3.html",
    "title": "Best Bites: Pepperidge Farms Caramel Macchiato flavored Milano cookies",
    "source": "St. Louis Post-Dispatch",
    "date": "1 week ago",
    "snippet": "Coffee-flavored food items are usually very hit or miss. But we have found \nthe cookie that has accomplished the absolute best coffee flavoring I ...",
    "thumbnail": "https://serpapi.com/searches/608ffbbcef7ddabfb2982432/images/45d252f31c08b743573f629544c119f07e8c422143bff0265f31c8c08086393a.jpeg"
  }
]

Сode to integrate:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "best cookies",
  "tbm": "nws",
  "start": "10",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for news_result in results["news_results"]:
  print(f"Title: {news_result['title']}\n")

Output:

Title: 10 Of The Absolute Best Cookies In Sydney
    
Title: This Cookie Quiz Will Reveal Your Best And Worst Quality

Title: Family cookies by Taimur Ali Khan is the best thing on internet

Title: Gibson Dunn Ranked Among Top Three Firms for Client ...

Title: Livingston CARES: Saying thank you to one cookie at a time

Title: Google's plan to replace cookies is the web's best hope for a more private internet

Title: The 12 Best Cookies in NYC

Title: 18 Places to Find the Best Cookies in the Champaign-Urbana ...

Title: Best Cookie Delivery Services - Where to Order Cookies Online

Title: How to make the best cookies for the holidays

Disclaimer, I work for SerpApi.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM