简体   繁体   中英

Scrape news from google news

This may seem like duplicates of other questions relating to scraping content from news.google.com but it is not because they are only requesting the entire html code, not the url link of the article.

I am trying to create two functions that can scrap news from news.google.com or get news based on what the user inputs ie:

>>> news top
> <5 url of top stories in news.google.com>

or

>>> news london
> <5 london related news url from news.google.com>

Here is my code work in progress (and because I am not very familiar with scraping/requesting, I do not know how to progress it):

def get_news(user_define_input):
    try:
        response = requests.get("https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=test&oq="+format(user_define_input[1]))
    except:
        print ("Error while retrieving data!")
        return
    tree = html.fromstring(response.text)
    news = tree.xpath("//div[@class='l _HId']/text()")
    print (news)

I do realize that /text() doesn't get the url but I don't know how, hence the question.

You can add this to make it look better if you want:

news = "<anything>".join(news)

To clear things up, user_define_input[0] would be "news" from what the user inputed. And user_define_input[1] would be the search ie: "london". So all results should be related to London. And if you are kind enough to take the time to make my other function to get all top stories from news.google.com, thank you very much! :) (It should be similar code so I am not going to post anything related to that on here)

Code after help (still not working):

def get_news(user_define_input):
    try:
        response = requests.get("https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=test&oq="+format(user_define_input[1]))
        except:
            print ("Error while retrieving data!")
                return
    tree = html.fromstring(response.text)
    url_to_news = tree.xpath(".//div[@class='esc-lead-article-title-wrapper']/h2[@class='esc-lead-article-title']/a/@href")
    for url in url_to_news:
        print(url)
    summary_of_the_new = tree.xpath(".//div[@class='esc-lead-snippet-wrapper']/text()")
    title_of_the_new = tree.xpath(".//span[@class='titletext']/text()")
    print (summary_of_the_new)
    print (title_of_the_new)

I understand that want you want is to get the url of all the news that appears when a user input a query , right?

To get that you will need this xpath expression:

url_to_news = tree.xpath(".//div[@class='esc-lead-article-title-wrapper']/h2[@class='esc-lead-article-title']/a/@href")

It will return a list with the url of the news.

As it is a list, to iterate over the urls you only need a for-loop:

for url in url_to_news:
    print(url)

Add-on:

To get the summary of the news you will need this:

summary_of_the_new = tree.xpath(".//div[@class='esc-lead-snippet-wrapper']/text()")

And finally, the titles of the news will be:

title_of_the_new = tree.xpath(".//span[@class='titletext']/text()")

After that you can map all that information together,. Please comment this answer if you need further help with this. I answered the question according to what I understood.

Check my Implementation @ http://mpand.github.io/gnp/

Returns the stories and URL as JSON object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM