简体   繁体   English

在PHP中获取HTML页面的OnLoad HTML / DOM

[英]getting OnLoad HTML/DOM for an HTML page in PHP

I am trying to get the HTML (ie what you see initially when the page completes loading) for some web-page URI. 我正在尝试获取某些网页URI的HTML代码(即页面完成加载时最初看到的内容)。 Stripping out all error checking and assuming static HTML, it's a single line of code: 去除所有错误检查并假定使用静态HTML,这是一行代码:

function GetDisplayedHTML($uri) {
   return file_get_contents($uri);
}

This works fine for static HTML, and is easy to extend by simple parsing, if the page has static file dependencies/references. 如果页面具有静态文件相关性/引用,则此方法对于静态HTML很好,并且易于通过简单的解析扩展。 So tags like <script src="XXX">, <a href="XXX">, <img src="XXX">, and CSS , can also be detected and the dependencies returned in an array, if they matter. 因此,还可以检测到<script src="XXX">, <a href="XXX">, <img src="XXX">, and CSS类的标签<script src="XXX">, <a href="XXX">, <img src="XXX">, and CSS在数组中返回依赖项(如果它们很重要)。

But what about web pages where the HTML is dynamically created using events/AJAX? 但是,使用事件/ AJAX动态创建HTML的网页呢? For example suppose the HTML for the web page is just a brief AJAX-based or OnLoad script that builds the visible web page? 例如,假设网页的HTML只是构建可见网页的基于AJAX的简短脚本还是OnLoad脚本? Then parsing alone won't work. 然后,仅进行解析将不起作用。

I guess what I need is a way from within PHP, to open and render the http response (ie the HTML we get at first) via some javascript engine or browser, and once it 'stabilises', capture the HTML (or static DOM?) that's now present, which will be what the user's actually seeing. 我想我需要的是一种从PHP内通过某种JavaScript引擎或浏览器打开和呈现http响应(即我们最初获得的HTML)的方法,一旦它“稳定”起来,就捕获HTML(或静态DOM? ),现在将是用户实际看到的内容。

Since such a webpage could continually change itself, I'd have to define "stable" (OnLoad or after X seconds?). 由于此类网页可能会不断变化,因此我必须定义“稳定”(OnLoad还是X秒后?)。 I also don't need to capture any timer or async event states (ie "things set in motion that might cause web page updates at some future time"). 我也不需要捕获任何计时器或异步事件状态(例如,“可能在将来某个时间导致网页更新的运动中发生的事情”)。 I only need enough of the DOM to represent the static appearance the user could see, at that time. 我只需要足够的DOM来表示用户当时可以看到的静态外观。

What would I need to do, to achieve this programmatically in PHP? 我需要做些什么才能在PHP中以编程方式实现这一目标?

To render page with JS you need to use some browser. 要使用JS渲染页面,您需要使用一些浏览器。 PhantomJS was created for tasks like this. PhantomJS是为此类任务创建的。 Here is simple script to run with Phantom: 这是与Phantom一起运行的简单脚本:

var webPage = require('webpage');
var page = webPage.create();
var system = require('system');
var args = system.args;

if (args.length === 1) {
    console.log('First argument must be page URL!');
} else {
    page.open(args[1], function (status) {
        window.setTimeout(function () { //Wait for scripts to run
            var content = page.content;
            console.log(content);
            phantom.exit();
        }, 500);
    });
}

It returns resulting HTML to console output. 它将结果HTML返回到控制台输出。 You can run it from console like this: 您可以从控制台运行它,如下所示:

./phantomjs.exe render.js http://yandex.ru

Or you can use PHP to run it: 或者您可以使用PHP运行它:

<?php
$path = dirname(__FILE__);
$html = shell_exec($path . DIRECTORY_SEPARATOR . 'phantomjs.exe render.js http://phantomjs.org/');

echo htmlspecialchars($html);

My PHP code assumes that PhantomJS executable is in the same directory as PHP script. 我的PHP代码假定PhantomJS可执行文件与PHP脚本位于同一目录中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM