简体   繁体   中英

PHP DOMXPath Get Value with Full Path - No ID

I am trying to get a value of an element through a direct XPath as the element has no ID.

$dom = new DOMDocument();
@$dom->loadHTML($rawHTML);
$finder = new DOMXPath($dom);

//this works well
$elements = $finder->query("//*[@id='html-ID-value']")->item(0);

//this does Not work
$testPath = '/html/body/div[2]/div[1]/div[7]/div[1]/div/div/table/tbody/tr[6]/td';

//tested several different ways to fetch the data
$elements = $finder->query("//*[@xpath='" . $testPath . "']");
$elements = $finder->query( $testPath );
$elements = $finder->evaluate( $testPath );

I am generating the test direct XPath through Firefox. I use the inspector to highlight an element, I then right click on it and choose copy XPath.

When using an ID the code works well, but I am not able to fetch the data using the direct XPath.

The element I am seeking does not have any unique values to search by. I would like to use the direct XPath rather than iterating through a complex DOM object, as I need this code to operate on many different paths that will all be different.

Any help would be much appreciated.

Thanks.

========== EDIT / UPDATE =================================================

Thank you very much for the replies. I have added a more full example of the problem I am having. In this example I am using google's home page and fetching data by ID and then another by full XPath. The ID fetches well and the full XPath fails.

I also tried the "evaluate" approach.

I am unable to reduce or simplify the full XPath data as this is just an example. The user will be generating this path if there is no ID to fetch by. So the path will be different every time based on what the user needs.

I do agree that maybe the path fetched on the browser is different after it is parsed in PHP, and maybe that is causing the problem. I do not know how I would remedy that issue.

<?php

error_reporting(E_ALL);
ini_set('display_errors', 1);

$ch = curl_init();

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_REFERER, "http://www.google.com/bot.html");
curl_setopt($ch, CURLOPT_HEADER, 0);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.127 Safari/534.16" );

curl_setopt($ch, CURLOPT_URL, "https://www.google.com/" );
$result = curl_exec($ch);
curl_close($ch);

$dom = new DOMDocument();
@$dom->loadHTML($result);
$finder = new DOMXPath($dom);

// get "google offered in:" text by id ----------------------------------------------
$elements = $finder->query("//*[@id='SIvCob']")->item(0);

$results = '';

if ($elements) {
    $results = $elements->firstChild->textContent;
} else {
    $results = "";
}

print('google language: [' . $results . "] <br>"); //returns "Google offered in: " as expected

// get "Store" text by full xpath, top left corner of page -------------------------------------------
$xpath = "/html/body/div/div[3]/div[1]/a[2]"; //path generated by firefox inspector, right clicking on element

$elements = $finder->query($xpath)->item(0);

$results = '';

if ($elements) {
    $results = $elements->firstChild->textContent;
} else {
    $results = "";
}

print('google store: [' . $results . "] <br>");  //returns nothing
print_r($elements); //returns nothing

//trying again ----------------------------------------------------------------------------

$result = $finder->evaluate($xpath);
foreach ($result as $node) {
    var_dump($node); //returns nothing
}

The parsed DOM in Firefox will not necessarily be the same as the original source. Firefox modifies/fixes the document. For example it adds the tbody element.

So try it without:

$expression = '/html/body/div[2]/div[1]/div[7]/div[1]/div/div/table/tr[6]/td';
$result = $finder->evaluate($expression);
foreach ($result as $node) {
  var_dump($node);
}

However I suggest using something else as the condition to make the expression less complex. For example the class attribute of the div around the table.

//div[@class="aClass anotherClass"]/table/tr[6]/td

Or the contents of the first th inside the table:

//table[contains((tr/th)[1], "Column Header")]/tr[6]/td

Maybe problem in that Google return another code for your grabber. When I used your demo code I gets totally another page (my location: Ukraine).

在此处输入图片说明

So first of all try to save grabbed HTML to file: file_put_contents('google.html', $result); After that, try to open this file in Firefox (with disabled JavaScript) and choose necessary element in Inspector and copy XPath.

PS If you want to create a good grabber and parser, I reccomend using Puppeteer (Chrome Headless). Here you can find a bridge for PHP: https://github.com/nesk/puphpeteer

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM