简体   繁体   中英

Stripping all tag attributes from a HTML string except style

I have a form where users can enter descriptions, using TinyMCE for styling. Because of this, my users have the ability to insert HTML. I am already stripping almost all HTML elements using strip_tags , but users can still input malicious values, such as this one:

<strong onclick="window.location='http://example.com'">Evil</strong>

I would like to prevent users from being able to do this, by stripping all attributes from all tags, except for the style attribute.

I can only find solutions to strip either all attributes, or strip only the specified ones. I would like to keep only the style attribute.

I have tried DOMDocument, but it seems to add DOCTYPE and html tags on its own, outputting it as an entire HTML document. Additionally, it sometimes seems to randomly add HTML entities such as upside-down question marks.

Here's my DOMDocument implementation:

//Example "evil" input
$description = "<p><strong onclick=\"alert('evil');\">Evil</strong></p>";

//Strip all tags from description except these
$description = strip_tags($description, '<p><br><a><b><i><u><strong><em><span><sup><sub>');

//Strip attributes from tags (to prevent inline Javascript)
$dom = new DOMDocument();
$dom->loadHTML($description);
foreach($dom->getElementsByTagName('*') as $element)
{
    //Attributes cannot be removed directly because DOMNamedNodeMap implements Traversable incorrectly
    //Atributes are first saved to an array and then looped over later
    $attributes_to_remove = array();
    foreach($element->attributes as $name => $value)
    {
        if($name != 'style')
        {
            $attributes_to_remove[] = $name;
        }
    }

    //Loop over saved attributes and remove them
    foreach($attributes_to_remove as $attribute)
    {
        $element->removeAttribute($attribute);
    }
}
echo $dom->saveHTML();

Here are two options for DOMDocument::loadHtml() that will solve the problem.

$dom->loadHTML($description,  LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

But they are only available in libxml >= 2.7.8. If you have an older version you can try a different approach:

If you know that you expect a fragment you can use that and save only the children of the body element.

$description = <<<'HTML'
<strong onclick="alert('evil');" style="text-align:center;">Evil</strong>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($description);
foreach($dom->getElementsByTagName('*') as $element) {
    $attributes_to_remove = iterator_to_array($element->attributes);
    unset($attributes_to_remove['style']);
    foreach($attributes_to_remove as $attribute => $value) {
        $element->removeAttribute($attribute);
    }
}
foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $node) {
  echo $dom->saveHTML($node);
}

Output:

<strong style="text-align:center;">Evil</strong>

I don't know if this is more or less what you mean to do...

$description = "<p><strong onclick=\"alert('evil');\">Evil</strong></p>";
$description = strip_tags( $description, '<p><br><a><b><i><u><strong><em><span><sup><sub>' );

$dom=new DOMDocument;
$dom->loadHTML( $description );
$tags=$dom->getElementsByTagName('*');

foreach( $tags as $tag ){
    if( $tag->hasAttributes() ){
        $attributes=$tag->attributes;
        foreach( $attributes as $name => $attrib ) $tag->removeAttribute( $name );
    }
}
echo $dom->saveHTML();
/* Will echo out `Evil` in bold but without the `onclick` */

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM