简体   繁体   中英

How to scrape specific data from scrape with simple html dom parser

I am trying to scrape the price data from a product on an amazon webpage, but I get more than just the price data in the variable, I get other elements included such as <span> etc. The code...

include 'simple_html_dom.php';
$html1 = file_get_html('http://www.amazon.co.uk/New-Apple-iPod-touch-Generation/dp/B0040GIZTI/ref=br_lf_m_1000333483_1_1_img?ie=UTF8&s=electronics&pf_rd_p=229345967&pf_rd_s=center-3&pf_rd_t=1401&pf_rd_i=1000333483&pf_rd_m=A3P5ROKL5A1OLE&pf_rd_r=1ZW9HJW2KN2C2MTRJH60');

$price_data1 = $html1->find('b[class=priceLarge]',0);

The variable then also contains data such as <b class="priceLarge">£163.00</b>

Is there a way to trim the unwanted data out ? I just need £163.00.

I am unsure if I do it during the find, or perhaps when i echo out the variable, then do I specify what I want ?

Cheers

just use

$result=$price_data1->innertext;

you will definitely get the desires output.

更改XPath以选择<b>元素的text()子元素,而不是选择元素本身。

$price_data1 = $html1->find('b[class=priceLarge]/text()',0);

You can try online API like Synthetics Web . You can extract data with minimum coding effort.

$url = urlencode('http://www.amazon.co.uk/New-Apple-iPod-touch-Generation/dp/B0040GIZTI/ref=br_lf_m_1000333483_1_1_img?ie=UTF8&s=electronics&pf_rd_p=229345967&pf_rd_s=center-3&pf_rd_t=1401&pf_rd_i=1000333483&pf_rd_m=A3P5ROKL5A1OLE&pf_rd_r=1ZW9HJW2KN2C2MTRJH60');
$wid = '160';

$data = json_decode(file_get_contents("http://www.syntheticsweb.com/resources/www.json?wid=$wid&url=$url"));

echo $data->price;
<b class="priceLarge">£163.00</b>

Simply use the following:

$p = "/b class=\"priceLarge\">(.*)<\/b>/";
preg_match($p, $html, $match)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM