简体   繁体   中英

Separate HTML, CSS, and JavaScript from file with DomDocument

I'm loading a remote file with PHP, and then trying to parse it with DomDocument . The file contains HTML, CSS (inside a style tag), and JavaScript (inside a script tag). Then I load it by separately by passing html or css or js into the function that is parsing it. The idea is that I can use core WordPress methods to display these in the proper locations.

This is the closest I've managed to get:

libxml_use_internal_errors( true );
$document = wp_remote_retrieve_body( $response ); // this is the remote HTML file
// create a new DomDocument object
$html = new DOMDocument( '1.0', 'UTF-8' );
// load the HTML into the DomDocument object (this would be your source HTML)
$html->loadHTML( $document, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
if ( 'html' === $part ) {
    $xpath  = new DOMXPath( $html );
    $remove = $xpath->query( "//*[style or script]" );
    foreach ( $remove as $node ) {
        $node->parentNode->removeChild($node);
    }
} elseif ( 'css' === $part ) {
    $xpath  = new DOMXPath( $html );
    $remove = $xpath->query( "//*[not(self::style)]" );
    foreach ( $remove as $node ) {
        $node->parentNode->removeChild($node);
    }
} elseif ( 'js' === $part ) {
    $xpath  = new DOMXPath( $html );
    $remove = $xpath->query( "//*[not(self::script)]" );
    foreach ( $remove as $node ) {
        $node->parentNode->removeChild($node);
    }
}

ob_start();
echo $html->saveHTML();
$output = ob_get_contents();
ob_end_clean();

This results in a few problems:

  1. On the CSS and JavaScript output, it keeps the style or script tag, and I'm trying to figure out how to remove it.
  2. On the HTML output, it keeps the <.DOCTYPE html PUBLIC "-//W3C//DTD HTML 4:0 Transitional//EN" "http.//www.w3.org/TR/REC-html40/loose.dtd"><html><head></head><body> and I'd like to remove that as well.

I'm not sure if I need to take this in another direction, or if I just need a small thing to remove these wrapping elements. But I had a lot of trouble getting xpath to relate to the elements I want to keep, rather than the ones I want to remove, and that's how I've ended up where I am.

For your html case, instead of saving the whole DOMDocument, you can save just the <body> element.

libxml_use_internal_errors( true );
$document = wp_remote_retrieve_body( $response ); // this is the remote HTML file
// create a new DomDocument object
$html = new DOMDocument( '1.0', 'UTF-8' );
// load the HTML into the DomDocument object (this would be your source HTML)
$html->loadHTML( $document, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
if ( 'html' === $part ) {
    // get all <body> elements
    $body_elements = $html->getElementsByTagName( 'body' );
    // it is to be assumed that there is only one <body> element.
    $body = $body_elements->item( 0 );
    // get the HTML contained within that body element
    $output = $body->ownerDocument->saveHTML( $body );
} else {
    // ...
}

For the CSS and JS elements, I'm not sure why you'd need to get just their inner contents without the containing tag, but a similar approach to what we just did with $body would work: 1. select the elements, 2. foreach loop over the array of elements, 3. get each element's saved insides (I believe but am not certain this will be a DOMText object) and concatenate those strings to create your eventual $output variable.

An alternate approach for CSS and JS: take your existing approach's cluster of <script> or <tag> elements, insert them into a blank DOMDocument 's <head> to save their containing <head> as an HTML string, and then enqueue that string via an anonymous function on WordPress' wp_enqueue_scripts hook:

/**
 * https://stackoverflow.com/questions/66361476/separate-html-css-and-javascript-from-file-with-domdocument?newreg=231eb52469c14d8c9c45ee9969df031a
 */
function wpse_66361476_alert() {
    $output = "<script>alert('hello');</script>"; // demonstration content
    add_action(
        'wp_enqueue_scripts',
        function() use ($output) {
            echo $output;
        }
    );
}
add_action('init', 'wpse_66361476_alert');

That approach is dangerous if you don't control the CSS and JS (and HTML) that you're outputting. It may be better to iframe in whatever you're loading here.

To improve page load speed if your host is not already using a frontend cache, you may want to look into caching the parsed elements using WordPress' caching functions. Here's a short overview ; talk to your hosting provider to see if there's specific advice they have.

The issue is with the DomNode(s). Check out DOMDocument remove script tags from HTML source which should give you an idea how to modify your code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM