[英]PHP - How to fetch src of img tag with specific class name using preg_match_all?
I am trying to create a scraper from an Amazon product search list page. 我正在尝试从Amazon产品搜索列表页面创建一个刮板。
method: 方法:
function getHTMLcode($url) {
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 10.10; labnol;) ctrlq.org");
curl_setopt($curl, CURLOPT_ENCODING, 'identity');
curl_setopt($curl, CURLOPT_FAILONERROR, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($curl);
curl_close($curl);
return $html;
}
method call: 方法调用:
$url="http://www.amazon.com/s/?url=search-alias%3Daps&field-keywords=iphone";
$html= getHTMLcode($url);
$image = '/src="(?P<img>[^"]*)"/';
preg_match_all($image,$html,$data);
var_dump($data);
Problem: This returns all src tags exists on page. 问题:这将返回页面上存在的所有src标记。 I need only products with has
class = "s-image"
but doesn't return h2 (product title) and the Price tags. 我只需要具有
class = "s-image"
但不返回h2(产品标题)和Price标签的产品。
Question: how to fetch only those image, title and price tags which has specific class name from amazon product search list. 问题:如何仅从亚马逊产品搜索列表中获取具有特定类别名称的图像,标题和价格标签。 Amazon returns
亚马逊退货
<img src="https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL436_.jpg" class="s-image" alt="Apple iPhone Xs Max with FaceTime - 256GB, 4G LTE, Space Gray" srcset="https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL436_.jpg 1x, https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL654_FMwebp_QL65_.jpg 1.5x, https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL872_FMwebp_QL65_.jpg 2x, https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL1090_FMwebp_QL65_.jpg 2.5x, https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL1308_FMwebp_QL65_.jpg 3x" data-image-index="0" data-image-load="" data-image-latency="s-product-image" data-image-source-density="1">
Similarly; 同样地; to get title and price of a product i am trying
获得我正在尝试的产品的标题和价格
$title = '/<h2 class="a-size-mini a-spacing-none a-color-base s-line-clamp-2">(?P<val>[^>]*)<\/h2>/';
preg_match_all($title,$html,$value);
var_dump($value);
$price ='/<span class="a-price-whole><span class="a-price-symbol"> <\/span>(?P<price>[^>]*)<\/span>/';
preg_match_all($price,$html,$cost);
var_dump($value);
You are using the wrong tools. 您使用了错误的工具。 You should use an HTML parser to do this, and XPath queries to find what you're looking for:
您应该使用HTML解析器来执行此操作,并使用XPath查询来查找所需内容:
<?php
$url="http://www.amazon.com/s/?url=search-alias%3Daps&field-keywords=iphone";
$html= getHTMLcode($url);
$dom = new DomDocument();
libxml_use_internal_errors();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);
$nodes = $xpath->query("//img[contains(@class, 's-image')]/@src");
foreach ($nodes as $node) {
$data[] = $node->textContent;
}
print_r($data);
Output: 输出:
Array
(
[0] => https://m.media-amazon.com/images/I/418H4DiygbL._AC_UL436_.jpg
[1] => https://m.media-amazon.com/images/I/61IzJCh8i8L._AC_UL436_.jpg
[2] => https://m.media-amazon.com/images/I/71RYhD1uzpL._AC_UL436_.jpg
[3] => https://m.media-amazon.com/images/I/41jUosGQiDL._AC_UL436_.jpg
[4] => https://m.media-amazon.com/images/I/51CBPR-l2VL._AC_UL436_.jpg
[5] => https://m.media-amazon.com/images/I/813nLXVhnwL._AC_UL436_.jpg
[6] => https://m.media-amazon.com/images/I/61WpoMEdpoL._AC_UL436_.jpg
[7] => https://m.media-amazon.com/images/I/913VoEdo-4L._AC_UL436_.jpg
[8] => https://m.media-amazon.com/images/I/81s7ZLOGOWL._AC_UL436_.jpg
[9] => https://m.media-amazon.com/images/I/81s7ZLOGOWL._AC_UL436_.jpg
[10] => https://m.media-amazon.com/images/I/513R4aVg1cL._AC_UL436_.jpg
[11] => https://m.media-amazon.com/images/I/51BbI-8wpTL._AC_UL436_.jpg
[12] => https://m.media-amazon.com/images/I/61pRPj+-IYL._AC_UL436_.jpg
[13] => https://m.media-amazon.com/images/I/71x3e0x+M2L._AC_UL436_.jpg
[14] => https://m.media-amazon.com/images/I/6165FLUs1+L._AC_UL436_.jpg
[15] => https://m.media-amazon.com/images/I/81ZJNQZBFCL._AC_UL436_.jpg
[16] => https://m.media-amazon.com/images/I/51sTR66B1UL._AC_UL436_.jpg
[17] => https://m.media-amazon.com/images/I/71QxMMTKiVL._AC_UL436_.jpg
[18] => https://m.media-amazon.com/images/I/61OUrdtiDcL._AC_UL436_.jpg
[19] => https://m.media-amazon.com/images/I/71ktNlpWWdL._AC_UL436_.jpg
[20] => https://m.media-amazon.com/images/I/51x3FM83EQL._AC_UL436_.jpg
[21] => https://m.media-amazon.com/images/I/41-Mv2nSrNL._AC_UL436_.jpg
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.