简体   繁体   中英

PHP Xpath Scrape Possible Namespace Issue

UPDATE: The source code is very much different from what Developer Tools shows.

Check out the source: view-source:http://www.machinerytrader.com/list/list.aspx?ETID=1&catid=1002

Is that javascript that needs to be rendered by a browser into html? If so, how can I have php do that process so that I have Html to parse? It's weird that you can use Xpath Checker to return the items I'm looking for (see below), but you cannot access the full html!

(Xpath: //table[contains(@id, 'ctl00_ContentPlaceHolder1') and (contains(@id,"tblContent") or contains(@id,"tblListingHeader"))])

END UPDATE

I need to scrape some information off of this site for work on a regular basis. I am attempting to write some PHP code to scrape this data. I think I have some namespace issues here, having read a number of other posts on SO. I have never encountered namespace problems before and used the approach shown on another SO post (to no avail :().

It appears the xpath query is just not happening for whatever reason. If you have any guesses or solutions as to how to handle this issue, I am open for suggestions.

Also here is the output from my code:

object(DOMXPath)#2 (0) {
}
Debug 1
array(0) {
}
array(0) {
}

I left out the bottom of the code where I var_dump testarray and create and var_dump otherarray. Their output is included above. Obviously the two arrays will be empty if the DOMXPath element has length 0 as well.

$string = 'http://www.machinerytrader.com/list/list.aspx?ETID=1&catid=1002';

$machine_trader = file_get_contents($string);
$xml = new DOMDocument();
$xml->loadHTML($machine_trader);

$xpath = new DOMXPath($xml);

$rootNamespace = $xml->lookupNamespaceUri($xml->namespaceURI); 
$xpath->registerNamespace('x', $rootNamespace); 

$tableRows = $xpath->query("//x:table[contains(@id, 'ctl00_ContentPlaceHolder1') and (contains(@id,'tblContent') or contains(@id,'tblListingHeader'))]");

var_dump($xpath);

$testarray = array();
$otherarray = array();

foreach ( $tableRows as $row )
{

        echo "Debug 1"."\n";

        $testarray[] = $row->nodeValue;

}

This is not an XPath issue insofar that the actual content is found from a form post, which you didn't reach yet. JS Source code here does nothing more than authenticate a proper 'user' for the information request, and then send the request via form submission.

At each request, the salt / encryption 'key' is randomized and changes, preventing simple scrapes.

You could rewrite that JavaScript to PHP and then issue two requests, battling the authentication process along the way.

Or, rather than diddle with reverse-engineering this, you could switch your scraping to NodeJS and use something like PhantomJS since it can evaluate javascript but give you programmatic access. Given the complexity of this task, it'd be much simpler to use the right tool.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM