簡體   English   中英

美麗湯的谷歌新聞標題標簽

[英]Tag of Google news title for beautiful soup

我試圖從谷歌新聞(例如疫苗)中提取搜索結果,並根據收集的標題提供一些情緒分析。

到目前為止,我似乎無法找到正確的標簽來收集標題。

這是我的代碼:

from textblob import TextBlob
import requests
from bs4 import BeautifulSoup

class Analysis:
    def __init__(self, term):
        self.term = term
        self.subjectivity = 0
        self.sentiment = 0
        self.url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)

    def run (self):
        response = requests.get(self.url)
        print(response.text)
        soup = BeautifulSoup(response.text, 'html.parser')
        headline_results = soup.find_all('div', class_="phYMDf nDgy9d")
        for h in headline_results:
            blob = TextBlob(h.get_text())
            self.sentiment += blob.sentiment.polarity / len(headline_results)
            self.subjectivity += blob.sentiment.subjectivity / len(headline_results)
a = Analysis('Vaccine')
a.run()
print(a.term, 'Subjectivity: ', a.subjectivity, 'Sentiment: ' , a.sentiment)

情感的結果始終為 0,主觀性的結果始終為 0。 我覺得問題出在 class_="phYMDf nDgy9d" 上。

如果您瀏覽該鏈接,您將看到頁面的完成狀態,但requests.get不會執行或加載您請求的頁面以外的任何數據。 幸運的是有一些數據,你可以抓取它。 我建議您使用諸如codebeautify 之類的 html美化服務來更好地了解頁面結構是什么。

此外,如果您看到類似phYMDf nDgy9d類,請務必避免使用它們查找。 它們是類的縮小版本,因此在任何時候如果它們更改 CSS 代碼的一部分,您正在尋找的類都會獲得一個新名稱。

我所做的可能有點矯枉過正,但是,我設法深入挖掘特定部分,現在您的代碼可以工作了。

在此處輸入圖片說明

當您查看請求的 html 文件的更漂亮版本時,必要的內容位於 div 中,其 id 為上面顯示的main 然后它的孩子從一個 div 元素谷歌搜索開始,繼續一個style元素,在一個空的 div 元素之后,有 post div 元素。 該子列表中的最后兩個元素是footerscript元素。 我們可以用[3:-2]切斷這些,然后在這棵樹下我們有純數據(幾乎)。 如果您檢查posts變量之后的其余代碼部分,我認為您可以理解它。

這是代碼:

from textblob import TextBlob
import requests, re
from bs4 import BeautifulSoup
from pprint import pprint

class Analysis:
    def __init__(self, term):
        self.term = term
        self.subjectivity = 0
        self.sentiment = 0
        self.url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)

    def run (self):
        response = requests.get(self.url)
        #print(response.text)
        soup = BeautifulSoup(response.text, 'html.parser')
        mainDiv = soup.find("div", {"id": "main"})
        posts = [i for i in mainDiv.children][3:-2]
        news = []
        for post in posts:
            reg = re.compile(r"^/url.*")
            cursor = post.findAll("a", {"href": reg})
            postData = {}
            postData["headline"] = cursor[0].find("div").get_text()
            postData["source"] = cursor[0].findAll("div")[1].get_text()
            postData["timeAgo"] = cursor[1].next_sibling.find("span").get_text()
            postData["description"] = cursor[1].next_sibling.find("span").parent.get_text().split("· ")[1]
            news.append(postData)
        pprint(news)
        for h in news:
            blob = TextBlob(h["headline"] + " "+ h["description"])
            self.sentiment += blob.sentiment.polarity / len(news)
            self.subjectivity += blob.sentiment.subjectivity / len(news)
a = Analysis('Vaccine')
a.run()

print(a.term, 'Subjectivity: ', a.subjectivity, 'Sentiment: ' , a.sentiment)

一些輸出:

[{'description': 'It comes after US health officials said last week they had '
                 'started a trial to evaluate a possible vaccine in Seattle. '
                 'The Chinese effort began on...',
  'headline': 'China embarks on clinical trial for virus vaccine',
  'source': 'The Star Online',
  'timeAgo': '5 saat önce'},
 {'description': 'Hanneke Schuitemaker, who is leading a team working on a '
                 'Covid-19 vaccine, tells of the latest developments and what '
                 'needs to be done now.',
  'headline': 'Vaccine scientist: ‘Everything is so new in dealing with this '
              'coronavirus’',
  'source': 'The Guardian',
  'timeAgo': '20 saat önce'},
 .
 .
 .
Vaccine Subjectivity:  0.34522727272727277 Sentiment:  0.14404040404040402
[{'description': '10 Cool Tech Gadgets To Survive Working From Home. From '
                 'Wi-Fi and cell phone signal boosters, to noise-cancelling '
                 'headphones and gadgets...',
  'headline': '10 Cool Tech Gadgets To Survive Working From Home',
  'source': 'CRN',
  'timeAgo': '2 gün önce'},
 {'description': 'Over the past few years, smart home products have dominated '
                 'the gadget space, with goods ranging from innovative updates '
                 'to the items we...',
  'headline': '6 Smart Home Gadgets That Are Actually Worth Owning',
  'source': 'Entrepreneur',
  'timeAgo': '2 hafta önce'},
 .
 .
 .
Home Gadgets Subjectivity:  0.48007305194805205 Sentiment:  0.3114683441558441

我使用標題和描述數據來執行操作,但如果您願意,您可以使用它們。 你現在有數據:)

用這個

headline_results = soup.find_all('div', {'class' : 'BNeawe vvjwJb AP7Wnd'})

你已經打印了 response.text,如果你想找到具體的數據,請從 response.text 結果中搜索

嘗試使用select()代替。 CSS選擇器更靈活。 CSS選擇器參考.

查看SelectorGadget Chrome 擴展程序,通過單擊瀏覽器中所需的元素來獲取CSS選擇器。

如果您想獲得所有標題等,那么您正在尋找這個容器:

soup.select('.dbsr')

確保傳遞user-agent ,因為 Google 最終可能會阻止您的請求,您將收到不同的 HTML,因此輸出為空。檢查您的user-agent是什么

通過user-agent

headers = {
    "User-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get("YOUR_URL", headers=headers)

我不確定你到底想做什么,但正如他提到的, Guven Degirmenci的解決方案有點矯枉過正,用切片, regex ,在div#main做一些事情。 這要簡單得多。


在線IDE中的代碼和示例

from textblob import TextBlob
import requests
from bs4 import BeautifulSoup

headers = {
   "User-agent":
   "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

class Analysis:
    def __init__(self, term):
        self.term = term
        self.subjectivity = 0
        self.sentiment = 0
        self.url = f"https://www.google.com/search?q={self.term}&tbm=nws"
    
 
    def run (self):
        response = requests.get(self.url, headers=headers)
        soup = BeautifulSoup(response.text, "html.parser")

        news_data = []

        for result in soup.select('.dbsr'):
          title = result.select_one('.nDgy9d').text
          link = result.a['href']
          source = result.select_one('.WF4CUc').text
          snippet = result.select_one('.Y3v8qd').text
          date_published = result.select_one('.WG9SHc span').text

          news_data.append({
            "title": title,
            "link": link,
            "source": source, 
            "snippet": snippet,
            "date_published": date_published
          })

        for h in news_data:
            blob = TextBlob(f"{h['title']} {h['snippet']}")
            self.sentiment += blob.sentiment.polarity / len(news_data)
            self.subjectivity += blob.sentiment.subjectivity / len(news_data)


a = Analysis("Lasagna")
a.run()

print(a.term, "Subjectivity: ", a.subjectivity, "Sentiment: " , a.sentiment)

# Vaccine Subjectivity:  0.3255952380952381 Sentiment:  0.05113636363636363
# Lasagna Subjectivity:  0.36556818181818185 Sentiment:  0.25386093073593075

或者,您可以使用來自 SerpApi 的Google 新聞結果 API來實現相同的目的。 這是一個帶有免費計划的付費 API。

您的情況的不同之處在於您不必維護解析器,弄清楚如何解析某些元素或找出某些東西無法正常工作的原因,以及了解如何繞過 Google 的塊。 所需要做的就是迭代結構化 JSON 並快速獲得所需內容。

與您的示例集成的代碼:


from textblob import TextBlob
import os
from serpapi import GoogleSearch


class Analysis:
    def __init__(self, term):
        self.term = term
        self.subjectivity = 0
        self.sentiment = 0
        self.url = f"https://www.google.com/search"
    
 
    def run (self):
        params = {
          "engine": "google",
          "tbm": "nws",
          "q": self.url,
          "api_key": os.getenv("API_KEY"),
        }

        search = GoogleSearch(params)
        results = search.get_dict()

        news_data = []

        for result in results['news_results']:
          title = result['title']
          link = result['link']
          snippet = result['snippet']
          source = result['source']
          date_published = result['date']

          news_data.append({
            "title": title,
            "link": link,
            "source": source, 
            "snippet": snippet,
            "date_published": date_published
          })

        for h in news_data:
            blob = TextBlob(f"{h['title']} {h['snippet']}")
            self.sentiment += blob.sentiment.polarity / len(news_data)
            self.subjectivity += blob.sentiment.subjectivity / len(news_data)


a = Analysis("Vaccine")
a.run()

print(a.term, "Subjectivity: ", a.subjectivity, "Sentiment: " , a.sentiment)


# Vaccine Subjectivity:  0.30957251082251086 Sentiment:  0.06277056277056277
# Lasagna Subjectivity:  0.30957251082251086 Sentiment:  0.06277056277056277

PS - 我寫了一篇關於如何抓取Google 新聞的更詳細的博客文章。

免責聲明,我為 SerpApi 工作。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM