parse HTML using DOMDocument in PHP

Question

I want to use DOMDocument to parse sting came from Rich-Text-Editor, exactly what I need are:

1) Allow only (div, p, span, b, ul, ol, li, blockquotem br) tags, remove others tags with its content

Edit: I'm using strip_tags() for this

2) allow only these styles:

style="font-weight:bold"
style="font-style: italic"
style="text-decoration: underline"

3) remove any attributes in the allowed tags like class, id ...etc except align attribute only

any ideas ?

Answer 1

I would recommend against trying to filter HTML input using DOMDocument for security reasons, in particular, due to the risk of cross-site scripting . You can easily take care of your requirements in 1 and 3 with a filter library like HTML Purifier . For the reasons Spudley mentions, number 2 is a little more difficult. I'd start by whitelisting those style attributes in HTML Purifier and then using some logic to scan for them after filtering, adding the appropriate tags inside that element.

Here's an example for using HTML Purifier how you want (taken from basic.php). The only things I've changed are the HTML.AllowedAttributes and HTML.AllowedElements settings.

<?php
// replace this with the path to the HTML Purifier library
require_once 'library/HTMLPurifier.auto.php';

$config = HTMLPurifier_Config::createDefault();

// configuration goes here:
$config->set('Core.Encoding', 'UTF-8'); // replace with your encoding
$config->set('HTML.Doctype', 'XHTML 1.0 Transitional'); // replace with your doctype
$config->set('HTML.AllowedAttributes', '*.style, align');
$config->set('HTML.AllowedElements', 'div, p, span, b, ul, ol, li, blockquote, br');
$config->set('CSS.AllowedProperties', 'font-weight, font-style, text-decoration');


$purifier = new HTMLPurifier($config);

$html = '<div align="center" style="font-style:italic; color: red" title="removeme">Allowed</div> <img src="not_allowed.jpg" /> <script>not allowed</script>';

$filteredHtml = $purifier->purify($html);
echo '<pre>' . htmlspecialchars($filteredHtml) . '</pre>';

Which outputs:

<div align="center" style="font-style:italic;">Allowed</div>,

Answer 2

Since you only want to allow a small number of HTML elements, you could consider cleaning up the HTML code with the PHP strip_tags() function prior to giving it to the DOMDocument classes.

This will certainly be easier than parsing the DOM yourself to find elements than need to be stripped.

That should deal with part 1 of your question.

It won't deal with parts 2 or 3, but it's a good start.

Answer 3

I have a code exactly to do this but its rather undocumented and uses some code I do not own but in public domain. Its pretty easy to use and it ensures all tags are closed so they do not affect your code, use fix_html function for that. It can also limit use of tags and attributes strip_tags_attributes for this, also use strip_javascript to remove javascript functionality of any sort. I used this extensively but to be honest I do not know if this one is from production. For your second answer, I guess its best to remove styles all together so they can use <i> or <b> as they like. And please dont let anyone to use underline.

function findNodeValue($parent, $node) {
    $nodes=array();
    if(!is_a($parent, "DOMElement")) return NULL;

    foreach($parent->childNodes as $child)
        if($child->nodeName==$node) $nodes[]=$child;

    if(count($nodes)==0) return NULL;
    if(count($nodes)==1) return $nodes[0]->nodeValue;
    else {
        $ret=array();
        foreach($nodes as $node)
            $ret[]=$node->nodeValue;

        return $ret;
    }
}

function strip_javascript($filter){ 

    // realign javascript href to onclick 
    $filter = preg_replace("/href=(['\"]).*?javascript:(.*)?\\1/i", "onclick=' $2 '", $filter);

    //remove javascript from tags 
    while( preg_match("/<(.*)?javascript.*?\(.*?((?>[^()]+)|(?R)).*?\)?\)(.*)?>/i", $filter)) 
        $filter = preg_replace("/<(.*)?javascript.*?\(.*?((?>[^()]+)|(?R)).*?\)?\)(.*)?>/i", "<$1$3$4$5>", $filter); 

    // dump expressions from contibuted content 
    $filter = preg_replace("/:expression\(.*?((?>[^(.*?)]+)|(?R)).*?\)\)/i", "", $filter); 
    $filter = preg_replace("/<iframe.*?>/", "", $filter);
    $filter = preg_replace("/<\/iframe>/", "", $filter);

    while( preg_match("/<(.*)?:expr.*?\(.*?((?>[^()]+)|(?R)).*?\)?\)(.*)?>/i", $filter)) 
        $filter = preg_replace("/<(.*)?:expr.*?\(.*?((?>[^()]+)|(?R)).*?\)?\)(.*)?>/i", "<$1$3$4$5>", $filter); 

    // remove all on* events    
    while( preg_match("/<(.*)?\s?on[^>\s]+?=\s?.+?(['\"]).*?\\2\s?(.*)?>/i", $filter, $match) ) {
        $filter = preg_replace("/<(.*)?\s?on[^>\s]+?=\s?.+?(['\"]).*?\\2\s?(.*)?>/i", "<$1$3>", $filter); 
    }

    return $filter; 
}

function html2a ( $html ) {
  ini_set('pcre.backtrack_limit', 10000);
  ini_set('pcre.recursion_limit', 10000);

  if ( !preg_match_all( '@\<\s*?(\w+)((?:\b(?:\'[^\']*\'|"[^"]*"|[^\>])*)?)\>((?:(?>[^\<]*)|(?R))*)\<\/\s*?\\1(?:\b[^\>]*)?\>|\<\s*(\w+)(\b(?:\'[^\']*\'|"[^"]*"|[^\>])*)?\/?\>@uxis', $html = trim($html), $m, PREG_OFFSET_CAPTURE | PREG_SET_ORDER) )
    return $html;
  $i = 0;
  $ret = array();
  foreach ($m as $set) {
    if ( strlen( $val = trim( substr($html, $i, $set[0][1] - $i) ) ) )
      $ret[] = $val;
    $val = $set[1][1] < 0 
      ? array( 'tag' => strtolower($set[4][0]) )
      : array( 'tag' => strtolower($set[1][0]), 'val' => html2a($set[3][0]) );
    if ( preg_match_all( '/(\w+)\s*(?:=\s*(?:"([^"]*)"|\'([^\']*)\'|(\w+)))?/usix', isset($set[5]) && $set[2][1] < 0 ? $set[5][0] : $set[2][0],$attrs, PREG_SET_ORDER ) ) {
      foreach ($attrs as $a) {
        $val['attr'][$a[1]]=$a[count($a)-1];
      }
    }
    $ret[] = $val;
    $i = $set[0][1]+strlen( $set[0][0] );
  }
  $l = strlen($html);
  if ( $i < $l )
    if ( strlen( $val = trim( substr( $html, $i, $l - $i ) ) ) )
      $ret[] = $val;
  return $ret;
}

function a2html ( $a, $in = "" ) {
  if ( is_array($a) ) {
    $s = "";
    foreach ($a as $t)
      if ( is_array($t) ) {
        $attrs=""; 
        if ( isset($t['attr']) )
          foreach( $t['attr'] as $k => $v )
            $attrs.=" ${k}=".( strpos( $v, '"' )!==false ? "'$v'" : "\"$v\"" );
        $s.= $in."<".$t['tag'].$attrs.( isset( $t['val'] ) ? ">\n".a2html( $t['val'], $in).$in."</".$t['tag'] : "/" ).">";
      } else
        $s.= $in.$t."";
  } else {
    $s = empty($a) ? "" : $in.$a."";
  }
  return $s;
}

function remove_unclosed(&$a, $allowunclosed) {
    if(!is_array($a)) return;

    foreach($a as $k=>$tag) {
        if(is_array($tag)) {
            if(!isset($tag["val"]) && !in_array($tag["tag"],$allowunclosed)) {
                //var_dump($tag["tag"]);
                unset($a[$k]);
            } elseif(is_array(@$tag["val"]))
                remove_unclosed($a[$k]["val"], $allowunclosed);
        }
    }
}

function fix_html($html, $allowunclosed=array("br")) {
    $a = html2a($html);
    remove_unclosed($a, $allowunclosed);
    return a2html($a);
}

function strip_tags_ex($str,$allowtags) { 
    $strs=explode('<',$str); 
    $res=$strs[0]; 
    for($i=1;$i<count($strs);$i++) 
    { 
        if(!strpos($strs[$i],'>')) 
            $res = $res.'&lt;'.$strs[$i]; 
        else 
            $res = $res.'<'.$strs[$i]; 
    } 
    return strip_tags($res,$allowtags);    
}

function strip_tags_attributes($string,$allowtags=allowedtags,$allowattributes=allowedattributes){
    $string=strip_javascript($string);

    $string = strip_tags_ex($string,$allowtags); 

    if (!is_null($allowattributes)) { 
        if(!is_array($allowattributes)) 
            $allowattributes = explode(",",$allowattributes); 
        if(is_array($allowattributes)) 
            $allowattributes = implode(")(?<!",$allowattributes); 
        if (strlen($allowattributes) > 0) 
            $allowattributes = "(?<!".$allowattributes.")"; 
        $string = preg_replace_callback("/<[^>]*>/i",create_function( 
            '$matches', 
            'return preg_replace("/ [^ =]*'.$allowattributes.'=(\"[^\"]*\"|\'[^\']*\')/i", "", $matches[0]);'    
        ),$string); 
    } 
    return $string; 
}

I found the source for strip_javascript http://www.php.net/manual/en/function.strip-tags.php#89453 I don't know why its not there in the code already. Probably because no name, no email no identity to refer.

Answer 4

$allowedTags = array( 'div' => true, 'p' => true, 'span' => true, 'b' => true,
    'ul' => true, 'ol' => true, 'li' => true, 'blockquot' => true, 'em' => true, 'br' => true );

$allowedStyles = array( 'font-weight: bold' => true, 'font-style: italic' => true, 'text-decoration: underline' => true );

$allowedAttribs = array( 'align' => true );

$doc = new DOMDocument();
$doc->loadXML( '<doc><p style="font-weight: bold">test</p> <b align="left">asdfasd faksd</b><script>asdfasd</script></doc>' );

sanitizeNodeChildren( $doc->documentElement );

echo htmlentities( $doc->saveXml() );

function sanitizeNodeChildren( $parentNode ) {
    $node = $parentNode->firstChild;
    while( $node ) {
        if( !sanitizeNode( $node ) ) {
            $nodeToDelete = $node;
            $node = $node->nextSibling;
            $parentNode->removeChild( $nodeToDelete );
        } else {
            sanitizeNodeChildren( $node );
            $node = $node->nextSibling;
        }
    }
}

function sanitizeNode( $node ) {
    global $allowedTags, $allowedStyles, $allowedAttribs;
    if( $node->nodeType == XML_ELEMENT_NODE ) {
        if( !isset( $allowedTags[ $node->tagName ] ) ) return false;

        foreach( $node->attributes as $name => $attrNode ) {
            if( $name == 'style' ) {
                if( isset( $allowedStyles[ $attrNode->nodeValue ] ) ) continue;
            }
            if( isset( $allowedAttribs[ $name ] ) ) continue;
            $node->removeAttribute( $name );
        }
    }

    return true;
}

parse HTML using DOMDocument in PHP

Question

4 answers

solution1
1 ACCPTED 2011-07-30 20:23:36

solution2
0 2011-07-30 20:15:35

solution3
0 2011-07-30 20:34:20

solution4
0 2011-07-30 21:34:22

parse HTML using DOMDocument in PHP

Question

4 answers

solution1 1 ACCPTED 2011-07-30 20:23:36

solution2 0 2011-07-30 20:15:35

solution3 0 2011-07-30 20:34:20

solution4 0 2011-07-30 21:34:22

solution1
1 ACCPTED 2011-07-30 20:23:36

solution2
0 2011-07-30 20:15:35

solution3
0 2011-07-30 20:34:20

solution4
0 2011-07-30 21:34:22