简体   繁体   English

使用xpath从网页中抓取特定文本

[英]Scraping specific text from a webpage using xpath

I've searched and tried multiple ways to get this but I'm not sure why it won't find most of the information on the webpage. 我已经搜索并尝试了多种方法来实现此目的,但是我不确定为什么它找不到网页上的大多数信息。

Page to scrape: https://m.safeguardproperties.com/ 要抓取的页面: https : //m.safeguardproperties.com/

Info needed: Version number for PhotoDirect for Apple (currently 4.4.0) 所需信息:Apple PhotoDirect的版本号(当前为4.4.0)

Xpath to text needed (I think) : /html/body/div[1]/div[2]/div[1]/div[4]/div[3]/a 需要文本的Xpath(我认为):/ html / body / div [1] / div [2] / div [1] / div [4] / div [3] / a

Attempts: 尝试:

<?php

$file = "https://m.safeguardproperties.com/";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);

$xpath = new DOMXpath($doc);

$elements = $xpath->query("/html/body/div[1]/div[2]/div[1]/div[4]/div[3]/a");

echo "<PRE>";

if (!is_null($elements)) {
  foreach ($elements as $element) {
      var_dump ($element);
    echo "<br/>[". $element->nodeName. "]";

    $nodes = $element->childNodes;
    foreach ($nodes as $node) {
      echo $node->nodeValue. "\n";
    }
  }
}

echo "</PRE>";

?>

Second Attempt: 第二次尝试:

<?PHP
$file = "https://m.safeguardproperties.com/";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);

echo '<pre>';

  // trying to find all links in document to see if I can see the correct one
  $links = [];
  $arr = $doc->getElementsByTagName("a");

  foreach($arr as $item) { 
    $href =  $item->getAttribute("href");
    $text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
    $links[] = [
      'href' => $href,
      'text' => $text
    ];
  }

var_dump($links);
echo '</pre>';
?>

For that particular website, the versions are being loaded from JSON data client side, you won't find them in the base document. 对于该特定网站,版本是从JSON数据客户端加载的,您不会在基础文档中找到它们。

http://m.safeguardproperties.com/js/photodirect.json http://m.safeguardproperties.com/js/photodirect.json

This was located by comparing the original document source to the finished DOM and inspecting the network activity in the developer console. 通过将原始文档源与完成的DOM进行比较并在开发人员控制台中检查网络活动来定位该位置。

$url = 'https://m.safeguardproperties.com/js/photodirect.json';
$json = file_get_contents( $url );
$object = json_decode( $json );
echo $object->ios->version; //4.4.0

Please respect other websites and cache your GET request. 请尊重其他网站并缓存您的GET请求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM