简体   繁体   中英

How to scrape Google News articles content from Google News RSS?

In the future, (maybe still far away, due to the fact that I'm still a novice) I want to do data analysis, based on the content of the news I get from the Google News RSS, but for that, I need to have access to that content, and that is my problem.

Using the URL " https://news.google.cl/news/rss " I have access to data like the title, and the URL of each news item, but the URL is in a format that does not allow me to scrape it ( https://news.google.com/__i/rss/rd/articles/CBMilgFod.. .).

news_url="https://news.google.cl/news/rss"
Client=urlopen(news_url)
xml_page=Client.read()
Client.close()

soup_page=soup(xml_page,"xml")
news_list=soup_page.findAll("item")

for news in news_list:
    print(news.title.text)
    print("-"*60)

    response = urllib.request.urlopen(news.link.text)
    html = response.read()
    soup = soup(html,"html.parser")
    text = soup.get_text(strip=True)
    print(text) 

The last print(text) prints some code like:

if(typeof bbclAM === 'undefined' || !bbclAM.isAM()) {
                        googletag.display('div-gpt-ad-1418416256666-0');
                } else {
                        document.getElementById('div-gpt-ad-1418416256666-0').st
yle.display = 'none'
                }
        });(function(s, p, d) {
            var h=d.location.protocol, i=p+"-"+s,
            e=d.getElementById(i), r=d.getElementById(p+"-root"),
            u=h==="https:"?"d1z2jf7jlzjs58.cloudfront.net"
            :"static."+p+".com";
            if (e) return;

I expect to print the title and the content of each news item from the RSS

Clone this project,

git clone git@github.com:philipperemy/google-news-scraper.git gns
cd gns
sudo pip install -r requirements.txt
python main_no_vpn.py

Out put will be

{
    "content": "............",
    "datetime": "...",
    "keyword": "...",
    "link": "...",
    "title": "..."
},
{
    "content": "............",
    "datetime": "...",
    "keyword": "...",
    "link": "...",
    "title": "..."
}

Source : Here

This script can get you something to start with (prints title, url, short description and content from the site). Parsing the content from the site is in basic form - each site has different format/styling etc. :

import textwrap
import requests
from bs4 import BeautifulSoup

news_url="https://news.google.cl/news/rss"
rss_text=requests.get(news_url).text
soup_page=BeautifulSoup(rss_text,"xml")

def get_items(soup):
    for news in soup.findAll("item"):
        s = BeautifulSoup(news.description.text, 'lxml')
        a = s.select('a')[-1]
        a.extract()         # extract lat 'See more on Google News..' link

        html = requests.get(news.link.text)
        soup_content = BeautifulSoup(html.text,"lxml")

        # perform basic sanitization:
        for t in soup_content.select('script, noscript, style, iframe, nav, footer, header'):
            t.extract()

        yield news.title.text.strip(), html.url, s.text.strip(), str(soup_content.select_one('body').text)

width = 80
for (title, url, shorttxt, content) in get_items(soup_page):
    title = '\n'.join(textwrap.wrap(title, width))
    url = '\n'.join(textwrap.wrap(url, width))
    shorttxt = '\n'.join(textwrap.wrap(shorttxt, width))
    content = '\n'.join(textwrap.wrap(textwrap.shorten(content, 1024), width))

    print(title)
    print(url)
    print('-' * width)
    print(shorttxt)
    print()
    print(content)
    print()

Prints:

WWF califica como inaceptable y condenable adulteración de información sobre
salmones de Nova Austral - El Mostrador
https://m.elmostrador.cl/dia/2019/06/30/wwf-califica-como-inaceptable-y-
condenable-adulteracion-de-informacion-sobre-salmones-de-nova-austral/
--------------------------------------------------------------------------------
El MostradorLa organización pide investigar los centros de cultivo de la
salmonera de capitales noruegos y abrirá un proceso formal de quejas. La empresa
ubicada en la ...

01:41:28 WWF califica como inaceptable y condenable adulteración de información
sobre salmones de Nova Austral - El Mostrador País PAÍS WWF califica como
inaceptable y condenable adulteración de información sobre salmones de Nova
Austral por El Mostrador 30 junio, 2019 La organización pide investigar los
centros de cultivo de la salmonera de capitales noruegos y abrirá un proceso
formal de quejas. La empresa ubicada en la Patagonia chilena es acusada de
falsear información oficial ante Sernapesca. 01:41:28 Compartir esta Noticia
Enviar por mail Rectificar Tras una investigación periodística de varios meses,
El Mostrador accedió a abundante información reservada, que incluye correos
electrónicos de la gerencia de producción de la compañía salmonera Nova Austral
–de capitales noruegos– a sus jefes de área, donde se instruye manipular las
estadísticas de mortalidad de los salmones para ocultar las verdaderas cifras a
Sernapesca –la entidad fiscalizadora–, a fin de evitar multas y ver disminuir
las [...]

...and so on.

In order to access data such as title and others, you first need to collect all the news in a list. Each news item is located in the iter tag, and they are in the channel tag. So let's use this sample:

soup.channel.find_all('item')

After that, you can extract the necessary data for each news.

for result in soup.channel.find_all('item'):
    title = result.title.text
    link = result.link.text
    date = result.pubDate.text
    source  = result.source.get("url")
    
    print(title, link, date, source, sep='\n', end='\n\n')

Also, make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent .

Code and full example in online IDE :

from bs4 import BeautifulSoup
import requests

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "hl": "en-US",         # language
    "gl": "US",            # country of the search, US -> USA
    "ceid": "US:en",
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}

html = requests.get("https://news.google.com/rss", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "xml")

for result in soup.channel.find_all('item'):
    title = result.title.text
    link = result.link.text
    date = result.pubDate.text
    source  = result.source.get("url")
    
    print(title, link, date, source, sep='\n', end='\n\n')

Output:

UK and Europe Heat Wave News: Live Updates - The New York Times
https://news.google.com/__i/rss/rd/articles/CBMiRGh0dHBzOi8vd3d3Lm55dGltZXMuY29tL2xpdmUvMjAyMi8wNy8xOS93b3JsZC91ay1ldXJvcGUtaGVhdC13ZWF0aGVy0gEA?oc=5
Tue, 19 Jul 2022 11:56:58 GMT
https://www.nytimes.com

... other results

Another way to achieve the same thing is to scrape Google News from the HTML instead.

I want to demonstrate how to scrape Google News using pagination. Оne of the ways is to use the start URL parameter which is equal to 0 by default. 0 means the first page, 10 is for the second, and so on.

Also, default search results return about ~10-15 pages. To increase the number of returned pages, you need to set the filter parameter to 0 and pass it to the URL which will return 10+ pages. Basically, this parameter defines the filters for Similar Results and Omitted Results .

While the next button exists, you need to increment the ["start"] parameter value by 10 to access the next page if it's present, otherwise we need to break out of the while loop.

And here is the code:

from bs4 import BeautifulSoup
import requests, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "Elon Musk",
    "hl": "en-US",          # language
    "gl": "US",             # country of the search, US -> USA
    "tbm": "nws",           # google news
    "start": 0,             # number page by default up to 0
    # "filter": 0           # shows more than 10 pages. By default up to ~10-15 if filter = 1.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}

page_num = 0
    
while True:
    page_num += 1
    print(f"{page_num} page:")
    
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, "lxml")
    
    for result in soup.select(".WlydOe"):
        source  = result.select_one(".NUnG9d").text
        title = result.select_one(".mCBkyc").text
        link = result.get("href")
        try:
            snippet = result.select_one(".GI74Re").text
        except AttributeError:
            snippet = None
        date = result.select_one(".ZE0LJd").text
        
        print(source, title, link, snippet, date, sep='\n', end='\n\n')

    if soup.select_one('.d6cvqb a[id=pnnext]'):
        params["start"] += 10
    else:
        break

Output:

1 page:
BuzzFeed News
Elon Musk’s Viral Shirtless Photos Have Sparked A Conversation Around 
Body-Shaming After Some People Argued That He “Deserves” To See The Memes 
Mocking His Physique
https://www.buzzfeednews.com/article/leylamohammed/elon-musk-shirtless-yacht-photos-memes-body-shaming
None
18 hours ago

People
Elon Musk Soaks Up Sun While Spending Time with Pals Aboard Luxury Yacht in 
Greece
https://people.com/human-interest/elon-musk-spends-time-with-friends-aboard-luxury-yacht-in-greece/
None
2 days ago

New York Post
Elon Musk jokes shirtless pictures in Mykonos are 'good motivation' to hit 
gym
https://nypost.com/2022/07/21/elon-musk-jokes-shirtless-pics-in-mykonos-are-good-motivation/
None
14 hours ago

... other results from the 1st and subsequent pages.

10 page:
Vanity Fair
A Reminder of Just Some of the Terrible Things Elon Musk Has Said and Done
https://www.vanityfair.com/news/2022/04/elon-musk-twitter-terrible-things-hes-said-and-done
... yesterday's news with “shock and dismay,” a lot of people are not 
enthused about the idea of Elon Musk buying the social media network.
Apr 26, 2022

CNBC
Elon Musk is buying Twitter. Now what?
https://www.cnbc.com/2022/04/27/elon-musk-just-bought-twitter-now-what.html
Elon Musk has finally acquired Twitter after a weekslong saga during which 
he first became the company's largest shareholder, then offered...
Apr 27, 2022

New York Magazine
11 Weird and Upsetting Facts About Elon Musk
https://nymag.com/intelligencer/2022/04/11-weird-and-upsetting-facts-about-elon-musk.html
3. Elon allegedly said some pretty awful things to his first wife · While 
dancing at their wedding reception, Musk told Justine, “I am the alpha...
Apr 30, 2022

... other results from 10th page.

If you need more information about Google News, have a look at Web Scraping Google News with Python blog post.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM