简体   繁体   English

Python Web-Scraping数据没有硬编码到HTML中

[英]Python Web-Scraping data that's not hard-coded into the HTML

I'm trying to scrape pricing data from insight.com. 我正在尝试从insight.com获取定价数据。 Here's an example page . 这是一个示例页面

From that page, I'd like to pull the ListPrice. 从那个页面,我想拉ListPrice。 I've done this before with requests and BeautifulSoup, but on those occasions the price would be directly in the HTML so it was rather easy to pull out. 我之前已经使用过请求和BeautifulSoup这样做了,但在这种情况下,价格将直接在HTML中,所以它很容易拉出来。 However, Insight appears to be getting this price data from "webProduct.prices[0].price", which I assume is a javascript object. 但是,Insight似乎从“webProduct.prices [0] .price”获取此价格数据,我认为这是一个javascript对象。

Here's the exact HTML element: 这是确切的HTML元素:

 <p class="ips-price-contract">List price</p><p class="prod-price">{{- webProduct.prices[0].currency }}&nbsp;{{= numeral(webProduct.prices[0].price).format(InsightUtil.GetCurrencyFormat()) }}</p>

Is there a way I can still get this pricing data with Python? 有没有办法可以用Python获得这个定价数据?

EDIT: Solution Below 编辑:下面的解决方案

Thanks to Harun Ergül's solution below, I was able to get this working. 感谢下面的HarunErgül解决方案,我能够实现这一目标。 First, I used the app postman to get the post working through there. 首先,我使用应用邮递员在那里工作。 Here's what the finished post looks like: 这是完成的帖子的样子: 岗位 身体中的Json Payload

To translate the json payload to python, I first formatted it as a python dict (eg replacing 'null' with 'None', 'true' and 'false' with 'True' and 'False', etc.) and then made the request with data=json.dumps(data) 要将json有效负载转换为python,我首先将其格式化为python dict(例如将'null'替换为'None','true'和'false'替换为'True'和'False'等)然后将其设置为数据请求= json.dumps(数据)

This website makes extra request for the price. 该网站提出额外的价格要求。 You should imitiae the same request.You can find it under chrome network xhr tab. 你应该模仿相同的请求。你可以在chrome network xhr标签下找到它。


在此输入图像描述

Don't use selenium kind of solution because it takes time to scrap a large set of data. 不要使用硒类解决方案,因为废弃大量数据需要时间。

The best way to handle javascript enabled pages, is to use selenium with a browser (there are drivers for all real-world browsers like chrome, firefox etc and even for headless browsers like phantomjs). 处理启用javascript的页面的最佳方法是使用selenium和浏览器(有适用于所有真实浏览器的驱动程序,如chrome,firefox等,甚至是无头浏览器,如phantomjs)。 This stack will fetch your page and run all javascript associated with the page. 此堆栈将获取您的页面并运行与该页面关联的所有JavaScript。 You can then get the processed source and extract your data from there (since now {{- webProduct.prices[0].currency }}&nbsp;{{= numeral(webProduct.prices[0].price).format(InsightUtil.GetCurrencyFormat()) }} would be replaced by the actual price) 然后,您可以从中获取已处理的源并从中提取数据(从现在开始{{- webProduct.prices[0].currency }}&nbsp;{{= numeral(webProduct.prices[0].price).format(InsightUtil.GetCurrencyFormat()) }}将被实际价格取代)

driver.get(page)
page_source = driver.source

Alternatively you can inspect the page in an actual browser, monitor its network activity, find out what api requests the page makes to get the necessary data and replicate those with the requests library. 或者,您可以在实际浏览器中检查页面,监视其网络活动,找出页面为获取必要数据所做的api请求,并使用请求库复制这些数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM