I want to scrape headlines and paragraph texts from Google News search page based on the term searched. I want to do that for first n pages.
I have wrote a piece of code for scraping the first page only, but I do not know how to modify my url
so that I can go to other pages to (page 2, 3...). That's the first problem that I have.
Second problem is that I do not know how to scrape headlines. It always returns me empty list. I have tried multiple solutions but it always returns me empty list. (I do not think that page is dynamic).
On the other hand scraping paragraph text below the headline works perfectly. Can you tell me how to fix these two problems?
This is my code:
from bs4 import BeautifulSoup
import requests
term = 'cocacola'
# this is only for page 1, how to go to page 2?
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# I think that this is not javascipt sensitive, its not dynamic
headline_results = soup.find_all('a', class_="l lLrAF")
#headline_results = soup.find_all('h3', class_="r dO0Ag") # also does not work
print(headline_results) #empty list, IDK why?
paragraph_results = soup.find_all('div', class_='st')
print(paragraph_results) # works
Problem One : Flipping the page.
In order to move to the next page you need to include start
keyword in your URL formatted string:
term = 'cocacola'
page = 2
url = 'https://www.google.com/search?q={}&source=lnms&tbm=nws&start={}'.format(
term, (page - 1) * 10
)
Problem Two : Scraping the headlines.
Google regenerates the names of classes, ids, etc. of DOM elements so your approach is likely to fail every time you retrieve some new, uncached information.
Just add parameter 'start=10' to the search term. Like: https://www.google.com/search?q=beatifulsoup&ie=utf-8&oe=utf-8&aq=t&start=10
For dynamic behavior/loop over response pages use something like this:
from bs4 import BeautifulSoup
from request import get
term="beautifulsoup"
page_max = 5
# loop over pages
for page in range(0, page_max):
url = "https://www.google.com/search?q={}&ie=utf-8&oe=utf-8&aq=t&start={}".format(term, 10*page)
r = get(url) # you can also add headers here
html_soup = BeautifulSoup(r.text, 'html.parser')
Link to partly identical question I answered before.
Alternatively, you can use Google News Result API from SerpApi. It's a paid API with a free trial.
Part of JSON output:
"news_results": [
{
"position": 1,
"link": "https://www.stltoday.com/lifestyles/food-and-cooking/best-bites-pepperidge-farms-caramel-macchiato-flavored-milano-cookies/article_d43e59a0-b362-5cb0-bdef-6b7563d9fed3.html",
"title": "Best Bites: Pepperidge Farms Caramel Macchiato flavored Milano cookies",
"source": "St. Louis Post-Dispatch",
"date": "1 week ago",
"snippet": "Coffee-flavored food items are usually very hit or miss. But we have found \nthe cookie that has accomplished the absolute best coffee flavoring I ...",
"thumbnail": "https://serpapi.com/searches/608ffbbcef7ddabfb2982432/images/45d252f31c08b743573f629544c119f07e8c422143bff0265f31c8c08086393a.jpeg"
}
]
Сode to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "best cookies",
"tbm": "nws",
"start": "10",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for news_result in results["news_results"]:
print(f"Title: {news_result['title']}\n")
Output:
Title: 10 Of The Absolute Best Cookies In Sydney
Title: This Cookie Quiz Will Reveal Your Best And Worst Quality
Title: Family cookies by Taimur Ali Khan is the best thing on internet
Title: Gibson Dunn Ranked Among Top Three Firms for Client ...
Title: Livingston CARES: Saying thank you to one cookie at a time
Title: Google's plan to replace cookies is the web's best hope for a more private internet
Title: The 12 Best Cookies in NYC
Title: 18 Places to Find the Best Cookies in the Champaign-Urbana ...
Title: Best Cookie Delivery Services - Where to Order Cookies Online
Title: How to make the best cookies for the holidays
Disclaimer, I work for SerpApi.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.