简体   繁体   中英

How to extract a table from website without specifying the web browser in python

I'm trying to automate data extraction from ASX ( https://www.asxenergy.com.au/futures_nz ) website into my database by writing a web scraping python script and deploying it in Azure Databrick. Currently, the script I have is working in Visual Studio Code, but when I try to run it in databrick, it crashes, throwing the error below.

Could not get version for google-chrome with the command: google-chrome --version || google-chrome-stable --version || google-chrome-beta --version || google-chrome-dev --version

I believe I will need to simplify my code in order to obtain the table without mentioning the we browser.

My sample code is below:

import time
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
import pandas as pd
import sys
from datetime import datetime
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()
options.add_argument('headless')
browser = webdriver.Chrome(ChromeDriverManager().install())
#browser = webdriver.Chrome('C:/chromedriver',options=options)  # Optional argument, if not specified will search path.
browser.get('https://www.asxenergy.com.au/futures_nz')
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html,'html.parser')
market_dataset = soup.find_all(attrs={'class':'market-dataset'})
market_dataset

I tried to use the below code instead, with just the request package, but it failed since it couldn't find the 'market-dataset' div class .

import time
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
import pandas as pd
import sys
from datetime import datetime
from webdriver_manager.chrome import ChromeDriverManager


URL = "https://www.asxenergy.com.au/futures_nz"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
market_dataset = soup.findAll("div",href=True,attrs={'class':'market-dataset'})

Can anyone please help me.

This page uses JavaScript to load table from https://www.asxenergy.com.au/futures_nz/dataset

Server checks if it is AJAX/XHR request so it needs header

 'X-Requested-With': 'XMLHttpRequest' 

But your findAll("div",href=True, ... tries to find <div href="..."> but this page doesn't have it - so I search normal <div> with class="market-dataset"


Minimal working code.

import requests
from bs4 import BeautifulSoup

headers = {
#    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0', 
    'X-Requested-With': 'XMLHttpRequest'     
}

URL = "https://www.asxenergy.com.au/futures_nz/dataset"
response = requests.get(URL, headers=headers)

soup = BeautifulSoup(response.content, "html.parser")
market_dataset = soup.findAll("div", attrs={'class':'market-dataset'})
print('len(market_dataset):', len(market_dataset))

Result:

len(market_dataset): 10

This might be helpful for you: [Building a JavaScript Table Web Scraper Using Python without Headless Browsers][1]

Originally published on:

Building a JavaScript Table Web Scraper Using Python without Headless Browsers - ScraperAPI ( https://www.scraperapi.com/blog/scrape-javascript-tables-python/ )

Web tables are some of the greatest sources of data on the web. They already have an easy-to-read and understand format and are used to display large amounts of useful information like employee data, statistics, original research models, and more.

That said, not all tables are made the same and some can be really tricky to scrape using conventional techniques.

In this tutorial, we'll understand the difference between HTML and JavaScript tables, why the latter is harder to scrape and we'll create a script to circumvent the challenges of rendering tables without using any highly complex technologies.

Table of Contents: (see link above for full article)

What Are JavaScript Tables?

HTML Tables vs. JavaScript Tables in Web Scraping

Scraping Dynamic Tables in Python with Requests

  1. Finding the Hidden API to Access the JSON Data

  2. Sending Our Initial HTTP Request

  3. Reading and Scraping the JSON Data

  4. Exporting Our Data to a CSV File

  5. Running Our Script [Full Code]

Wrapping Up: Scale Your Scraper with ScraperAPI

Happy scraping!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM