I'm trying to automate data extraction from ASX ( https://www.asxenergy.com.au/futures_nz ) website into my database by writing a web scraping python script and deploying it in Azure Databrick. Currently, the script I have is working in Visual Studio Code, but when I try to run it in databrick, it crashes, throwing the error below.
Could not get version for google-chrome with the command: google-chrome --version || google-chrome-stable --version || google-chrome-beta --version || google-chrome-dev --version
I believe I will need to simplify my code in order to obtain the table without mentioning the we browser.
My sample code is below:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
import pandas as pd
import sys
from datetime import datetime
from webdriver_manager.chrome import ChromeDriverManager
options = webdriver.ChromeOptions()
options.add_argument('headless')
browser = webdriver.Chrome(ChromeDriverManager().install())
#browser = webdriver.Chrome('C:/chromedriver',options=options) # Optional argument, if not specified will search path.
browser.get('https://www.asxenergy.com.au/futures_nz')
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html,'html.parser')
market_dataset = soup.find_all(attrs={'class':'market-dataset'})
market_dataset
I tried to use the below code instead, with just the request
package, but it failed since it couldn't find the 'market-dataset' div class
.
import time
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
import pandas as pd
import sys
from datetime import datetime
from webdriver_manager.chrome import ChromeDriverManager
URL = "https://www.asxenergy.com.au/futures_nz"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
market_dataset = soup.findAll("div",href=True,attrs={'class':'market-dataset'})
Can anyone please help me.
This page uses JavaScript to load table from https://www.asxenergy.com.au/futures_nz/dataset
Server checks if it is AJAX/XHR request so it needs header
'X-Requested-With': 'XMLHttpRequest'
But your findAll("div",href=True, ...
tries to find <div href="...">
but this page doesn't have it - so I search normal <div>
with class="market-dataset"
Minimal working code.
import requests
from bs4 import BeautifulSoup
headers = {
# 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
'X-Requested-With': 'XMLHttpRequest'
}
URL = "https://www.asxenergy.com.au/futures_nz/dataset"
response = requests.get(URL, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
market_dataset = soup.findAll("div", attrs={'class':'market-dataset'})
print('len(market_dataset):', len(market_dataset))
Result:
len(market_dataset): 10
This might be helpful for you: [Building a JavaScript Table Web Scraper Using Python without Headless Browsers][1]
Originally published on:
Building a JavaScript Table Web Scraper Using Python without Headless Browsers - ScraperAPI ( https://www.scraperapi.com/blog/scrape-javascript-tables-python/ )
Web tables are some of the greatest sources of data on the web. They already have an easy-to-read and understand format and are used to display large amounts of useful information like employee data, statistics, original research models, and more.
That said, not all tables are made the same and some can be really tricky to scrape using conventional techniques.
In this tutorial, we'll understand the difference between HTML and JavaScript tables, why the latter is harder to scrape and we'll create a script to circumvent the challenges of rendering tables without using any highly complex technologies.
Table of Contents: (see link above for full article)
What Are JavaScript Tables?
HTML Tables vs. JavaScript Tables in Web Scraping
Scraping Dynamic Tables in Python with Requests
Finding the Hidden API to Access the JSON Data
Sending Our Initial HTTP Request
Reading and Scraping the JSON Data
Exporting Our Data to a CSV File
Running Our Script [Full Code]
Wrapping Up: Scale Your Scraper with ScraperAPI
Happy scraping!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.