简体   繁体   中英

Scraping Google Finance (BeautifulSoup)

I'm trying to scrape Google Finance, and get the "Related Stocks" table, which has id "cc-table" and class "gf-table" based on the webpage inspector in Chrome. (Sample Link: https://www.google.com/finance?q=tsla )

But when I run.find("table") or.findAll("table"), this table does not come up. I can find JSON-looking objects with the table's contents in the HTML content in Python, but do not know how to get it. Any ideas?

The page is rendered with JavaScript. There are several ways to render and scrape it.

I can scrape it with Selenium. First install Selenium:

sudo pip3 install selenium

Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads

import bs4 as bs
from selenium import webdriver  
browser = webdriver.Chrome()
url = ("https://www.google.com/finance?q=tsla")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "lxml")
for el in soup.find_all("table", {"id": "cc-table"}):
    print(el.get_text())

Alternatively PyQt5

from PyQt5.QtGui import *  
from PyQt5.QtCore import *  
from PyQt5.QtWebKit import *  
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
import bs4 as bs
import sys

class Render(QWebPage):  
    def __init__(self, url):  
        self.app = QApplication(sys.argv)  
        QWebPage.__init__(self)  
        self.loadFinished.connect(self._loadFinished)  
        self.mainFrame().load(QUrl(url))  
        self.app.exec_()  

    def _loadFinished(self, result):  
        self.frame = self.mainFrame()  
        self.app.quit()  

url = "https://www.google.com/finance?q=tsla"
r = Render(url)  
result = r.frame.toHtml()
soup = bs.BeautifulSoup(result,'lxml')
for el in soup.find_all("table", {"id": "cc-table"}):
    print(el.get_text())

Alternatively Dryscrape

import bs4 as bs
import dryscrape

url = "https://www.google.com/finance?q=tsla"
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'lxml')
for el in soup.find_all("table", {"id": "cc-table"}):
    print(el.get_text())

all output:

Valuation▲▼Company name▲▼Price▲▼Change▲▼Chg %▲▼d | m | y▲▼Mkt Cap▲▼TSLATesla Inc328.40-1.52-0.46%53.69BDDAIFDaimler AG72.94-1.50-2.01%76.29BFFord Motor Company11.53-0.17-1.45%45.25BGMGeneral Motors Co...36.07-0.34-0.93%53.93BRNSDFRENAULT SA EUR3.8197.000.000.00%28.69BHMCHonda Motor Co Lt...27.52-0.18-0.65%49.47BAUDVFAUDI AG NPV840.400.000.00%36.14BTMToyota Motor Corp...109.31-0.53-0.48%177.79BBAMXFBAYER MOTOREN WER...94.57-2.41-2.48%56.93BNSANYNissan Motor Co L...20.400.000.00%42.85BMMTOFMITSUBISHI MOTOR ...6.86+0.091.26%10.22B

EDIT

QtWebKit got deprecated upstream in Qt 5.5 and removed in 5.6.

You can switch to PyQt5.QtWebEngineWidgets

Most website owners don't like scrapers because they take data the company values, use up a whole bunch of their server time and bandwidth, and give nothing in return. Big companies like Google may have entire teams employing a whole host of methods to detect and block bots trying to scrape their data.

There are several ways around this:

  • Scrape from another less secured website.
  • See if Google or another company has an API for public use.
  • Use a more advanced scraper like Selenium (and probably still be blocked by google).

You can scrape Google Finance using BeautifulSoup web scraping library without the need to use selenium as the data you want to extract doesn't render via Javascript. Plus it will be much faster than launching the whole browser.

Check code in online IDE .


from bs4 import BeautifulSoup
import requests, lxml, json
   
params = {
        "hl": "en" 
        }

headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
        }

html = requests.get(f"https://www.google.com/finance?q=tsla)", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

ticker_data = []

for ticker in soup.select('.tOzDHb'):
  title = ticker.select_one('.RwFyvf').text
  price = ticker.select_one('.YMlKec').text
  index = ticker.select_one('.COaKTb').text
  price_change = ticker.select_one("[jsname=Fe7oBc]")["aria-label"]

  ticker_data.append({
    "index": index,
  "title" : title,
  "price" : price,
  "price_change" : price_change
  })  
print(json.dumps(ticker_data, indent=2))

Example output

[
  {
    "index": "Index",
    "title": "Dow Jones Industrial Average",
    "price": "32,774.41",
    "price_change": "Down by 0.18%"
  },
  {
    "index": "Index",
    "title": "S&P 500",
    "price": "4,122.47",
    "price_change": "Down by 0.42%"
  },
  {
    "index": "TSLA",
    "title": "Tesla Inc",
    "price": "$850.00",
    "price_change": "Down by 2.44%"
  },
  # ...
]

There's a scrape Google Finance Ticker Quote Data in Python blog post if you need to scrape more data from Google Finance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM