In the future, (maybe still far away, due to the fact that I'm still a novice) I want to do data analysis, based on the content of the news I get from the Google News RSS, but for that, I need to have access to that content, and that is my problem.
Using the URL " https://news.google.cl/news/rss " I have access to data like the title, and the URL of each news item, but the URL is in a format that does not allow me to scrape it ( https://news.google.com/__i/rss/rd/articles/CBMilgFod.. .).
news_url="https://news.google.cl/news/rss"
Client=urlopen(news_url)
xml_page=Client.read()
Client.close()
soup_page=soup(xml_page,"xml")
news_list=soup_page.findAll("item")
for news in news_list:
print(news.title.text)
print("-"*60)
response = urllib.request.urlopen(news.link.text)
html = response.read()
soup = soup(html,"html.parser")
text = soup.get_text(strip=True)
print(text)
The last print(text)
prints some code like:
if(typeof bbclAM === 'undefined' || !bbclAM.isAM()) {
googletag.display('div-gpt-ad-1418416256666-0');
} else {
document.getElementById('div-gpt-ad-1418416256666-0').st
yle.display = 'none'
}
});(function(s, p, d) {
var h=d.location.protocol, i=p+"-"+s,
e=d.getElementById(i), r=d.getElementById(p+"-root"),
u=h==="https:"?"d1z2jf7jlzjs58.cloudfront.net"
:"static."+p+".com";
if (e) return;
I expect to print the title and the content of each news item from the RSS
Clone this project,
git clone git@github.com:philipperemy/google-news-scraper.git gns
cd gns
sudo pip install -r requirements.txt
python main_no_vpn.py
Out put will be
{
"content": "............",
"datetime": "...",
"keyword": "...",
"link": "...",
"title": "..."
},
{
"content": "............",
"datetime": "...",
"keyword": "...",
"link": "...",
"title": "..."
}
Source : Here
This script can get you something to start with (prints title, url, short description and content from the site). Parsing the content from the site is in basic form - each site has different format/styling etc. :
import textwrap
import requests
from bs4 import BeautifulSoup
news_url="https://news.google.cl/news/rss"
rss_text=requests.get(news_url).text
soup_page=BeautifulSoup(rss_text,"xml")
def get_items(soup):
for news in soup.findAll("item"):
s = BeautifulSoup(news.description.text, 'lxml')
a = s.select('a')[-1]
a.extract() # extract lat 'See more on Google News..' link
html = requests.get(news.link.text)
soup_content = BeautifulSoup(html.text,"lxml")
# perform basic sanitization:
for t in soup_content.select('script, noscript, style, iframe, nav, footer, header'):
t.extract()
yield news.title.text.strip(), html.url, s.text.strip(), str(soup_content.select_one('body').text)
width = 80
for (title, url, shorttxt, content) in get_items(soup_page):
title = '\n'.join(textwrap.wrap(title, width))
url = '\n'.join(textwrap.wrap(url, width))
shorttxt = '\n'.join(textwrap.wrap(shorttxt, width))
content = '\n'.join(textwrap.wrap(textwrap.shorten(content, 1024), width))
print(title)
print(url)
print('-' * width)
print(shorttxt)
print()
print(content)
print()
Prints:
WWF califica como inaceptable y condenable adulteración de información sobre
salmones de Nova Austral - El Mostrador
https://m.elmostrador.cl/dia/2019/06/30/wwf-califica-como-inaceptable-y-
condenable-adulteracion-de-informacion-sobre-salmones-de-nova-austral/
--------------------------------------------------------------------------------
El MostradorLa organización pide investigar los centros de cultivo de la
salmonera de capitales noruegos y abrirá un proceso formal de quejas. La empresa
ubicada en la ...
01:41:28 WWF califica como inaceptable y condenable adulteración de información
sobre salmones de Nova Austral - El Mostrador País PAÍS WWF califica como
inaceptable y condenable adulteración de información sobre salmones de Nova
Austral por El Mostrador 30 junio, 2019 La organización pide investigar los
centros de cultivo de la salmonera de capitales noruegos y abrirá un proceso
formal de quejas. La empresa ubicada en la Patagonia chilena es acusada de
falsear información oficial ante Sernapesca. 01:41:28 Compartir esta Noticia
Enviar por mail Rectificar Tras una investigación periodística de varios meses,
El Mostrador accedió a abundante información reservada, que incluye correos
electrónicos de la gerencia de producción de la compañía salmonera Nova Austral
–de capitales noruegos– a sus jefes de área, donde se instruye manipular las
estadísticas de mortalidad de los salmones para ocultar las verdaderas cifras a
Sernapesca –la entidad fiscalizadora–, a fin de evitar multas y ver disminuir
las [...]
...and so on.
In order to access data such as title and others, you first need to collect all the news in a list. Each news item is located in the iter
tag, and they are in the channel
tag. So let's use this sample:
soup.channel.find_all('item')
After that, you can extract the necessary data for each news.
for result in soup.channel.find_all('item'):
title = result.title.text
link = result.link.text
date = result.pubDate.text
source = result.source.get("url")
print(title, link, date, source, sep='\n', end='\n\n')
Also, make sure you're using request headers user-agent
to act as a "real" user visit. Because default requests
user-agent
is python-requests
and websites understand that it's most likely a script that sends a request. Check what's your user-agent
.
Code and full example in online IDE :
from bs4 import BeautifulSoup
import requests
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"hl": "en-US", # language
"gl": "US", # country of the search, US -> USA
"ceid": "US:en",
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}
html = requests.get("https://news.google.com/rss", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "xml")
for result in soup.channel.find_all('item'):
title = result.title.text
link = result.link.text
date = result.pubDate.text
source = result.source.get("url")
print(title, link, date, source, sep='\n', end='\n\n')
Output:
UK and Europe Heat Wave News: Live Updates - The New York Times
https://news.google.com/__i/rss/rd/articles/CBMiRGh0dHBzOi8vd3d3Lm55dGltZXMuY29tL2xpdmUvMjAyMi8wNy8xOS93b3JsZC91ay1ldXJvcGUtaGVhdC13ZWF0aGVy0gEA?oc=5
Tue, 19 Jul 2022 11:56:58 GMT
https://www.nytimes.com
... other results
Another way to achieve the same thing is to scrape Google News from the HTML instead.
I want to demonstrate how to scrape Google News using pagination. Оne of the ways is to use the start
URL parameter which is equal to 0
by default. 0
means the first page, 10
is for the second, and so on.
Also, default search results return about ~10-15 pages. To increase the number of returned pages, you need to set the filter
parameter to 0
and pass it to the URL which will return 10+ pages. Basically, this parameter defines the filters for Similar Results
and Omitted Results
.
While the next button exists, you need to increment the ["start"]
parameter value by 10 to access the next page if
it's present, otherwise we need to break
out of the while
loop.
And here is the code:
from bs4 import BeautifulSoup
import requests, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "Elon Musk",
"hl": "en-US", # language
"gl": "US", # country of the search, US -> USA
"tbm": "nws", # google news
"start": 0, # number page by default up to 0
# "filter": 0 # shows more than 10 pages. By default up to ~10-15 if filter = 1.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}
page_num = 0
while True:
page_num += 1
print(f"{page_num} page:")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
for result in soup.select(".WlydOe"):
source = result.select_one(".NUnG9d").text
title = result.select_one(".mCBkyc").text
link = result.get("href")
try:
snippet = result.select_one(".GI74Re").text
except AttributeError:
snippet = None
date = result.select_one(".ZE0LJd").text
print(source, title, link, snippet, date, sep='\n', end='\n\n')
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
Output:
1 page:
BuzzFeed News
Elon Musk’s Viral Shirtless Photos Have Sparked A Conversation Around
Body-Shaming After Some People Argued That He “Deserves” To See The Memes
Mocking His Physique
https://www.buzzfeednews.com/article/leylamohammed/elon-musk-shirtless-yacht-photos-memes-body-shaming
None
18 hours ago
People
Elon Musk Soaks Up Sun While Spending Time with Pals Aboard Luxury Yacht in
Greece
https://people.com/human-interest/elon-musk-spends-time-with-friends-aboard-luxury-yacht-in-greece/
None
2 days ago
New York Post
Elon Musk jokes shirtless pictures in Mykonos are 'good motivation' to hit
gym
https://nypost.com/2022/07/21/elon-musk-jokes-shirtless-pics-in-mykonos-are-good-motivation/
None
14 hours ago
... other results from the 1st and subsequent pages.
10 page:
Vanity Fair
A Reminder of Just Some of the Terrible Things Elon Musk Has Said and Done
https://www.vanityfair.com/news/2022/04/elon-musk-twitter-terrible-things-hes-said-and-done
... yesterday's news with “shock and dismay,” a lot of people are not
enthused about the idea of Elon Musk buying the social media network.
Apr 26, 2022
CNBC
Elon Musk is buying Twitter. Now what?
https://www.cnbc.com/2022/04/27/elon-musk-just-bought-twitter-now-what.html
Elon Musk has finally acquired Twitter after a weekslong saga during which
he first became the company's largest shareholder, then offered...
Apr 27, 2022
New York Magazine
11 Weird and Upsetting Facts About Elon Musk
https://nymag.com/intelligencer/2022/04/11-weird-and-upsetting-facts-about-elon-musk.html
3. Elon allegedly said some pretty awful things to his first wife · While
dancing at their wedding reception, Musk told Justine, “I am the alpha...
Apr 30, 2022
... other results from 10th page.
If you need more information about Google News, have a look at Web Scraping Google News with Python blog post.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.