從Google新聞中抓取新聞

Question

這似乎與其他與從news.google.com抓取內容有關的問題的重復相同，但這不是因為它們僅請求完整的html代碼，而不是文章的url鏈接。

我正在嘗試創建兩個函數，這些函數可以從news.google.com抓取新聞或根據用戶輸入的內容獲取新聞，即：

>>> news top
> <5 url of top stories in news.google.com>

要么

>>> news london
> <5 london related news url from news.google.com>

這是我正在進行的代碼工作（由於我對抓取/請求不是很熟悉，所以我不知道如何進行處理）：

def get_news(user_define_input):
    try:
        response = requests.get("https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=test&oq="+format(user_define_input[1]))
    except:
        print ("Error while retrieving data!")
        return
    tree = html.fromstring(response.text)
    news = tree.xpath("//div[@class='l _HId']/text()")
    print (news)

我確實意識到/text()無法獲取url，但是我不知道如何，因此出現了問題。

如果需要，您可以添加它以使其看起來更好：

news = "<anything>".join(news)

為了清除問題， user_define_input[0]將是用戶輸入的“新聞”。 和user_define_input[1]將是搜索，即：“倫敦”。 因此，所有結果都應與倫敦有關。 如果您足夠友善地抽出時間來利用我的其他功能來獲取news.google.com上的所有熱門新聞，則非常感謝！ ：）（應該是類似的代碼，所以我不會在此處發布任何與此相關的內容）

幫助后的代碼（仍然無法正常工作）：

def get_news(user_define_input):
    try:
        response = requests.get("https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=test&oq="+format(user_define_input[1]))
        except:
            print ("Error while retrieving data!")
                return
    tree = html.fromstring(response.text)
    url_to_news = tree.xpath(".//div[@class='esc-lead-article-title-wrapper']/h2[@class='esc-lead-article-title']/a/@href")
    for url in url_to_news:
        print(url)
    summary_of_the_new = tree.xpath(".//div[@class='esc-lead-snippet-wrapper']/text()")
    title_of_the_new = tree.xpath(".//span[@class='titletext']/text()")
    print (summary_of_the_new)
    print (title_of_the_new)

Answer 1

我了解您想要的是獲取用戶輸入query時出現的所有新聞的url ，對嗎？

為此，您將需要以下xpath表達式：

url_to_news = tree.xpath(".//div[@class='esc-lead-article-title-wrapper']/h2[@class='esc-lead-article-title']/a/@href")

它將返回包含新聞網址的列表。

因為它是一個列表，所以要遍歷URL，您只需要一個for循環：

for url in url_to_news:
    print(url)

添加在：

要獲取新聞摘要，您將需要以下內容：

summary_of_the_new = tree.xpath(".//div[@class='esc-lead-snippet-wrapper']/text()")

最后，新聞標題為：

title_of_the_new = tree.xpath(".//span[@class='titletext']/text()")

之后，您可以將所有這些信息映射在一起。 如果您需要進一步的幫助，請對此答案發表評論。 我根據我的理解回答了這個問題。

Answer 2

檢查我的實現@ http://mpand.github.io/gnp/

返回故事和URL作為JSON對象

從Google新聞中抓取新聞

問題描述

2 個解決方案

解決方案1
1 2015-08-03 04:30:05

解決方案2
0 2015-08-03 10:23:56

從Google新聞中抓取新聞

問題描述

2 個解決方案

解決方案1 1 2015-08-03 04:30:05

解決方案2 0 2015-08-03 10:23:56

解決方案1
1 2015-08-03 04:30:05

解決方案2
0 2015-08-03 10:23:56