简体   繁体   中英

getting OnLoad HTML/DOM for an HTML page in PHP

I am trying to get the HTML (ie what you see initially when the page completes loading) for some web-page URI. Stripping out all error checking and assuming static HTML, it's a single line of code:

function GetDisplayedHTML($uri) {
   return file_get_contents($uri);
}

This works fine for static HTML, and is easy to extend by simple parsing, if the page has static file dependencies/references. So tags like <script src="XXX">, <a href="XXX">, <img src="XXX">, and CSS , can also be detected and the dependencies returned in an array, if they matter.

But what about web pages where the HTML is dynamically created using events/AJAX? For example suppose the HTML for the web page is just a brief AJAX-based or OnLoad script that builds the visible web page? Then parsing alone won't work.

I guess what I need is a way from within PHP, to open and render the http response (ie the HTML we get at first) via some javascript engine or browser, and once it 'stabilises', capture the HTML (or static DOM?) that's now present, which will be what the user's actually seeing.

Since such a webpage could continually change itself, I'd have to define "stable" (OnLoad or after X seconds?). I also don't need to capture any timer or async event states (ie "things set in motion that might cause web page updates at some future time"). I only need enough of the DOM to represent the static appearance the user could see, at that time.

What would I need to do, to achieve this programmatically in PHP?

To render page with JS you need to use some browser. PhantomJS was created for tasks like this. Here is simple script to run with Phantom:

var webPage = require('webpage');
var page = webPage.create();
var system = require('system');
var args = system.args;

if (args.length === 1) {
    console.log('First argument must be page URL!');
} else {
    page.open(args[1], function (status) {
        window.setTimeout(function () { //Wait for scripts to run
            var content = page.content;
            console.log(content);
            phantom.exit();
        }, 500);
    });
}

It returns resulting HTML to console output. You can run it from console like this:

./phantomjs.exe render.js http://yandex.ru

Or you can use PHP to run it:

<?php
$path = dirname(__FILE__);
$html = shell_exec($path . DIRECTORY_SEPARATOR . 'phantomjs.exe render.js http://phantomjs.org/');

echo htmlspecialchars($html);

My PHP code assumes that PhantomJS executable is in the same directory as PHP script.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM