简体   繁体   中英

How to remove Inner HTML from parsed HTML with DOMDocument?

I have to process user-provided markup for a specific kind of embed, which is typically in the form of a <script> tag, typically with a src attribute. There are a variety of different <script> components that can be used here, each one different. However, to avoid potential XSS attacks, we've deemed it necessary to strip out anything inside the tag.

<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js">document.write("vinny say something funny"); //This should be sanitized out</script>

DOMDocument really doesn't give us an easy way to alter the innerhtml, and I have seen a few approaches but none seem to address keeping attribute intact if the tag is destroyed. Am I missing something in implementing a best approach, or is there an easier way to go about addressing this?

This code removes child nodes from the <script> node. In this case it's the document element:

<?php
$xml = '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js">document.write("vinny say something funny");</script>';                               

$doc = new DOMDocument();
$doc->loadXml($xml);

$scriptNode = $doc->documentElement;

while ($scriptNode->hasChildNodes()) {
    $scriptNode->removeChild($scriptNode->lastChild);
}

echo $doc->saveXML();

Output is:

<?xml version="1.0"?>
<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"/>

As a simple method is to do a shallow clone of the node (using cloneNode() ) without the optional parameter.

This will go through the loaded document and replace each script node with the new content...

$html = '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js">document.write("vinny say something funny");</script>';

$doc = new DOMDocument();
$doc->loadHTML($html);

foreach ( $doc->getElementsByTagName("script") as $script ){
    $script->parentNode->replaceChild($script->cloneNode(), $script);
}
echo $doc->saveHTML();

gives...

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"></script></head></html>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM