简体   繁体   中英

Python Web-Scraping data that's not hard-coded into the HTML

I'm trying to scrape pricing data from insight.com. Here's an example page .

From that page, I'd like to pull the ListPrice. I've done this before with requests and BeautifulSoup, but on those occasions the price would be directly in the HTML so it was rather easy to pull out. However, Insight appears to be getting this price data from "webProduct.prices[0].price", which I assume is a javascript object.

Here's the exact HTML element:

 <p class="ips-price-contract">List price</p><p class="prod-price">{{- webProduct.prices[0].currency }}&nbsp;{{= numeral(webProduct.prices[0].price).format(InsightUtil.GetCurrencyFormat()) }}</p>

Is there a way I can still get this pricing data with Python?

EDIT: Solution Below

Thanks to Harun Ergül's solution below, I was able to get this working. First, I used the app postman to get the post working through there. Here's what the finished post looks like: 岗位 身体中的Json Payload

To translate the json payload to python, I first formatted it as a python dict (eg replacing 'null' with 'None', 'true' and 'false' with 'True' and 'False', etc.) and then made the request with data=json.dumps(data)

This website makes extra request for the price. You should imitiae the same request.You can find it under chrome network xhr tab.


在此输入图像描述

Don't use selenium kind of solution because it takes time to scrap a large set of data.

The best way to handle javascript enabled pages, is to use selenium with a browser (there are drivers for all real-world browsers like chrome, firefox etc and even for headless browsers like phantomjs). This stack will fetch your page and run all javascript associated with the page. You can then get the processed source and extract your data from there (since now {{- webProduct.prices[0].currency }}&nbsp;{{= numeral(webProduct.prices[0].price).format(InsightUtil.GetCurrencyFormat()) }} would be replaced by the actual price)

driver.get(page)
page_source = driver.source

Alternatively you can inspect the page in an actual browser, monitor its network activity, find out what api requests the page makes to get the necessary data and replicate those with the requests library.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM