Stripping all tag attributes from a HTML string except style

Question

I have a form where users can enter descriptions, using TinyMCE for styling. Because of this, my users have the ability to insert HTML. I am already stripping almost all HTML elements using strip_tags , but users can still input malicious values, such as this one:

<strong onclick="window.location='http://example.com'">Evil</strong>

I would like to prevent users from being able to do this, by stripping all attributes from all tags, except for the style attribute.

I can only find solutions to strip either all attributes, or strip only the specified ones. I would like to keep only the style attribute.

I have tried DOMDocument, but it seems to add DOCTYPE and html tags on its own, outputting it as an entire HTML document. Additionally, it sometimes seems to randomly add HTML entities such as upside-down question marks.

Here's my DOMDocument implementation:

//Example "evil" input
$description = "<p><strong onclick=\"alert('evil');\">Evil</strong></p>";

//Strip all tags from description except these
$description = strip_tags($description, '<p><br><a><b><i><u><strong><em><span><sup><sub>');

//Strip attributes from tags (to prevent inline Javascript)
$dom = new DOMDocument();
$dom->loadHTML($description);
foreach($dom->getElementsByTagName('*') as $element)
{
    //Attributes cannot be removed directly because DOMNamedNodeMap implements Traversable incorrectly
    //Atributes are first saved to an array and then looped over later
    $attributes_to_remove = array();
    foreach($element->attributes as $name => $value)
    {
        if($name != 'style')
        {
            $attributes_to_remove[] = $name;
        }
    }

    //Loop over saved attributes and remove them
    foreach($attributes_to_remove as $attribute)
    {
        $element->removeAttribute($attribute);
    }
}
echo $dom->saveHTML();

Answer 1

Here are two options for DOMDocument::loadHtml() that will solve the problem.

$dom->loadHTML($description,  LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

But they are only available in libxml >= 2.7.8. If you have an older version you can try a different approach:

If you know that you expect a fragment you can use that and save only the children of the body element.

$description = <<<'HTML'
<strong onclick="alert('evil');" style="text-align:center;">Evil</strong>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($description);
foreach($dom->getElementsByTagName('*') as $element) {
    $attributes_to_remove = iterator_to_array($element->attributes);
    unset($attributes_to_remove['style']);
    foreach($attributes_to_remove as $attribute => $value) {
        $element->removeAttribute($attribute);
    }
}
foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $node) {
  echo $dom->saveHTML($node);
}

Output:

<strong style="text-align:center;">Evil</strong>

Answer 2

I don't know if this is more or less what you mean to do...

$description = "<p><strong onclick=\"alert('evil');\">Evil</strong></p>";
$description = strip_tags( $description, '<p><br><a><b><i><u><strong><em><span><sup><sub>' );

$dom=new DOMDocument;
$dom->loadHTML( $description );
$tags=$dom->getElementsByTagName('*');

foreach( $tags as $tag ){
    if( $tag->hasAttributes() ){
        $attributes=$tag->attributes;
        foreach( $attributes as $name => $attrib ) $tag->removeAttribute( $name );
    }
}
echo $dom->saveHTML();
/* Will echo out `Evil` in bold but without the `onclick` */

Stripping all tag attributes from a HTML string except style

Question

2 answers

solution1
1 ACCPTED 2015-10-20 13:18:05

solution2
0 2015-10-20 08:19:02

Stripping all tag attributes from a HTML string except style

Question

2 answers

solution1 1 ACCPTED 2015-10-20 13:18:05

solution2 0 2015-10-20 08:19:02

solution1
1 ACCPTED 2015-10-20 13:18:05

solution2
0 2015-10-20 08:19:02