[英]Tag of Google news title for beautiful soup
我試圖從谷歌新聞(例如疫苗)中提取搜索結果,並根據收集的標題提供一些情緒分析。
到目前為止,我似乎無法找到正確的標簽來收集標題。
這是我的代碼:
from textblob import TextBlob
import requests
from bs4 import BeautifulSoup
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)
def run (self):
response = requests.get(self.url)
print(response.text)
soup = BeautifulSoup(response.text, 'html.parser')
headline_results = soup.find_all('div', class_="phYMDf nDgy9d")
for h in headline_results:
blob = TextBlob(h.get_text())
self.sentiment += blob.sentiment.polarity / len(headline_results)
self.subjectivity += blob.sentiment.subjectivity / len(headline_results)
a = Analysis('Vaccine')
a.run()
print(a.term, 'Subjectivity: ', a.subjectivity, 'Sentiment: ' , a.sentiment)
情感的結果始終為 0,主觀性的結果始終為 0。 我覺得問題出在 class_="phYMDf nDgy9d" 上。
如果您瀏覽該鏈接,您將看到頁面的完成狀態,但requests.get
不會執行或加載您請求的頁面以外的任何數據。 幸運的是有一些數據,你可以抓取它。 我建議您使用諸如codebeautify 之類的 html美化服務來更好地了解頁面結構是什么。
此外,如果您看到類似phYMDf nDgy9d
類,請務必避免使用它們查找。 它們是類的縮小版本,因此在任何時候如果它們更改 CSS 代碼的一部分,您正在尋找的類都會獲得一個新名稱。
我所做的可能有點矯枉過正,但是,我設法深入挖掘特定部分,現在您的代碼可以工作了。
當您查看請求的 html 文件的更漂亮版本時,必要的內容位於 div 中,其 id 為上面顯示的main
。 然后它的孩子從一個 div 元素谷歌搜索開始,繼續一個style
元素,在一個空的 div 元素之后,有 post div 元素。 該子列表中的最后兩個元素是footer
和script
元素。 我們可以用[3:-2]
切斷這些,然后在這棵樹下我們有純數據(幾乎)。 如果您檢查posts
變量之后的其余代碼部分,我認為您可以理解它。
這是代碼:
from textblob import TextBlob
import requests, re
from bs4 import BeautifulSoup
from pprint import pprint
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)
def run (self):
response = requests.get(self.url)
#print(response.text)
soup = BeautifulSoup(response.text, 'html.parser')
mainDiv = soup.find("div", {"id": "main"})
posts = [i for i in mainDiv.children][3:-2]
news = []
for post in posts:
reg = re.compile(r"^/url.*")
cursor = post.findAll("a", {"href": reg})
postData = {}
postData["headline"] = cursor[0].find("div").get_text()
postData["source"] = cursor[0].findAll("div")[1].get_text()
postData["timeAgo"] = cursor[1].next_sibling.find("span").get_text()
postData["description"] = cursor[1].next_sibling.find("span").parent.get_text().split("· ")[1]
news.append(postData)
pprint(news)
for h in news:
blob = TextBlob(h["headline"] + " "+ h["description"])
self.sentiment += blob.sentiment.polarity / len(news)
self.subjectivity += blob.sentiment.subjectivity / len(news)
a = Analysis('Vaccine')
a.run()
print(a.term, 'Subjectivity: ', a.subjectivity, 'Sentiment: ' , a.sentiment)
一些輸出:
[{'description': 'It comes after US health officials said last week they had '
'started a trial to evaluate a possible vaccine in Seattle. '
'The Chinese effort began on...',
'headline': 'China embarks on clinical trial for virus vaccine',
'source': 'The Star Online',
'timeAgo': '5 saat önce'},
{'description': 'Hanneke Schuitemaker, who is leading a team working on a '
'Covid-19 vaccine, tells of the latest developments and what '
'needs to be done now.',
'headline': 'Vaccine scientist: ‘Everything is so new in dealing with this '
'coronavirus’',
'source': 'The Guardian',
'timeAgo': '20 saat önce'},
.
.
.
Vaccine Subjectivity: 0.34522727272727277 Sentiment: 0.14404040404040402
[{'description': '10 Cool Tech Gadgets To Survive Working From Home. From '
'Wi-Fi and cell phone signal boosters, to noise-cancelling '
'headphones and gadgets...',
'headline': '10 Cool Tech Gadgets To Survive Working From Home',
'source': 'CRN',
'timeAgo': '2 gün önce'},
{'description': 'Over the past few years, smart home products have dominated '
'the gadget space, with goods ranging from innovative updates '
'to the items we...',
'headline': '6 Smart Home Gadgets That Are Actually Worth Owning',
'source': 'Entrepreneur',
'timeAgo': '2 hafta önce'},
.
.
.
Home Gadgets Subjectivity: 0.48007305194805205 Sentiment: 0.3114683441558441
我使用標題和描述數據來執行操作,但如果您願意,您可以使用它們。 你現在有數據:)
用這個
headline_results = soup.find_all('div', {'class' : 'BNeawe vvjwJb AP7Wnd'})
你已經打印了 response.text,如果你想找到具體的數據,請從 response.text 結果中搜索
嘗試使用select()
代替。 CSS
選擇器更靈活。 CSS
選擇器參考.
查看SelectorGadget Chrome 擴展程序,通過單擊瀏覽器中所需的元素來獲取CSS
選擇器。
如果您想獲得所有標題等,那么您正在尋找這個容器:
soup.select('.dbsr')
確保傳遞user-agent
,因為 Google 最終可能會阻止您的請求,您將收到不同的 HTML,因此輸出為空。檢查您的user-agent
是什么
通過user-agent
:
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
我不確定你到底想做什么,但正如他提到的, Guven Degirmenci的解決方案有點矯枉過正,用切片, regex
,在div#main
做一些事情。 這要簡單得多。
from textblob import TextBlob
import requests
from bs4 import BeautifulSoup
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = f"https://www.google.com/search?q={self.term}&tbm=nws"
def run (self):
response = requests.get(self.url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
news_data = []
for result in soup.select('.dbsr'):
title = result.select_one('.nDgy9d').text
link = result.a['href']
source = result.select_one('.WF4CUc').text
snippet = result.select_one('.Y3v8qd').text
date_published = result.select_one('.WG9SHc span').text
news_data.append({
"title": title,
"link": link,
"source": source,
"snippet": snippet,
"date_published": date_published
})
for h in news_data:
blob = TextBlob(f"{h['title']} {h['snippet']}")
self.sentiment += blob.sentiment.polarity / len(news_data)
self.subjectivity += blob.sentiment.subjectivity / len(news_data)
a = Analysis("Lasagna")
a.run()
print(a.term, "Subjectivity: ", a.subjectivity, "Sentiment: " , a.sentiment)
# Vaccine Subjectivity: 0.3255952380952381 Sentiment: 0.05113636363636363
# Lasagna Subjectivity: 0.36556818181818185 Sentiment: 0.25386093073593075
或者,您可以使用來自 SerpApi 的Google 新聞結果 API來實現相同的目的。 這是一個帶有免費計划的付費 API。
您的情況的不同之處在於您不必維護解析器,弄清楚如何解析某些元素或找出某些東西無法正常工作的原因,以及了解如何繞過 Google 的塊。 所需要做的就是迭代結構化 JSON 並快速獲得所需內容。
與您的示例集成的代碼:
from textblob import TextBlob
import os
from serpapi import GoogleSearch
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = f"https://www.google.com/search"
def run (self):
params = {
"engine": "google",
"tbm": "nws",
"q": self.url,
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
news_data = []
for result in results['news_results']:
title = result['title']
link = result['link']
snippet = result['snippet']
source = result['source']
date_published = result['date']
news_data.append({
"title": title,
"link": link,
"source": source,
"snippet": snippet,
"date_published": date_published
})
for h in news_data:
blob = TextBlob(f"{h['title']} {h['snippet']}")
self.sentiment += blob.sentiment.polarity / len(news_data)
self.subjectivity += blob.sentiment.subjectivity / len(news_data)
a = Analysis("Vaccine")
a.run()
print(a.term, "Subjectivity: ", a.subjectivity, "Sentiment: " , a.sentiment)
# Vaccine Subjectivity: 0.30957251082251086 Sentiment: 0.06277056277056277
# Lasagna Subjectivity: 0.30957251082251086 Sentiment: 0.06277056277056277
PS - 我寫了一篇關於如何抓取Google 新聞的更詳細的博客文章。
免責聲明,我為 SerpApi 工作。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.