谷歌搜索爬蟲，Python

Question

我是 Python 的新手，並試圖制作一個 Google 搜索刮刀以獲取股票價格，但我在下面運行我的代碼我沒有得到任何結果，而是獲得了頁面 HTML 格式。

import urllib.request
from bs4 import BeautifulSoup

import requests

url = 'https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=uwti'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html, "html.parser")

print(soup.prettify())

我是否遺漏了一些非常簡單的東西，請給我一些提示。 我正在嘗試提取當前的股票價值。如何在附加圖像中提取此價值？

Answer 1

當您右鍵單擊並在瀏覽器中選擇查看源代碼時，它就在源代碼中。 您只需要稍微更改url並傳遞一個用戶代理以匹配您使用請求在那里看到的內容：

In [2]: from bs4 import BeautifulSoup
   ...: import requests
   ...: 
   ...: url = 'https://www.google.com/search?q=uwti&rct=j'
   ...: response = requests.get(url, headers={
   ...:     "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (K
   ...: HTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36"})
   ...: html = response.content
   ...: 
   ...: soup = BeautifulSoup(html, "html.parser")
   ...: print(soup.select_one("span._Rnb.fmob_pr.fac-l").text)
   ...: 
27.51

soup.find("span", class_="_Rnb fmob_pr fac-l").text也可以工作，並且是使用帶有 find 或find_all的css 類查找標簽的正確方法

當您使用https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=uwti 時，您可以在 chrome 中看到，重定向到https://www.google .com/search?q=uwti&rct=j :

Answer 2

很容易做到：

將user-agent添加到您的請求中，以便 Google 將您的請求視為真正的用戶訪問。 用戶代理列表。
使用 Chrome 擴展程序通過SelectorGadget快速查找CSS選擇器
將提取的css選擇器與.select_one() bs4方法結合使用來獲取數據。

在線IDE中的代碼和示例：

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=spgsclp', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')

current_stock_price = soup.select_one('.wT3VGc').text
print(current_stock_price)

>>> 108,52

或者，您可以使用來自 SerpApi 的Google Direct Answer Box API執行相同的操作。 這是一個付費 API，可免費試用 5,000 次搜索。

這個例子中最大的不同是你不必弄清楚為什么有些東西不起作用，也不必弄清楚如何抓取這些數據。 獲取數據的過程要清晰得多。

集成代碼：

from serpapi import GoogleSearch

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google",
  "q": "spgsclp",
}

search = GoogleSearch(params)
results = search.get_dict()

current_stock_price = results['answer_box']['price']
print(current_stock_price)

>>> 108,52

免責聲明，我為 SerpApi 工作。

Answer 3

查看Beautiful Soup's文檔，了解如何選擇您剛剛解析的 HTML 文檔的元素，您可以嘗試以下操作：

soup.findAll("span", ['_Rnb', 'fmob_pr, 'fac-l'])

上述方法將找到實現列表中的類的 span 元素。

僅供參考：我所看到的初始請求不會獲取股票價格，使用瀏覽器的Inspect Element功能來捕獲發送的請求，據我所見，有一個對 url https://www.google.gr/async/finance_price_updates的請求https://www.google.gr/async/finance_price_updates 。 也許這用於獲取股票的價格，看看您是否可以直接向它發送請求而不是獲取整個 HTML。

Answer 4

谷歌不會給你刮它所以你必須使用一些 API 或只是改變股票的網站。

import urllib
from bs4 import BeautifulSoup

url = 'siteurl'
response = urllib.urlopen(url)

soup = BeautifulSoup(response, "html.parser")

print(soup.findAll("div", { "class" : 'classname' }))

您可以通過更改“siteurl”和“classname”（您必須刮擦）來使用此代碼

谷歌搜索爬蟲，Python

問題描述

4 個解決方案

解決方案1
3 2016-10-15 23:36:40

解決方案2
1 2021-03-25 11:50:17

解決方案3
0 2016-10-15 22:05:18

解決方案4
0 2017-08-06 06:58:35

谷歌搜索爬蟲，Python

問題描述

4 個解決方案

解決方案1 3 2016-10-15 23:36:40

解決方案2 1 2021-03-25 11:50:17

解決方案3 0 2016-10-15 22:05:18

解決方案4 0 2017-08-06 06:58:35

解決方案1
3 2016-10-15 23:36:40

解決方案2
1 2021-03-25 11:50:17

解決方案3
0 2016-10-15 22:05:18

解決方案4
0 2017-08-06 06:58:35