In DomDocument, reuse of DOMXpath, it is stable?

Question

I am using the function below, but not sure about it is always stable/secure... Is it?

When and who is stable/secure to "reuse parts of the DOMXpath preparing procedures"?

To simlify the use of the XPath query() method we can adopt a function that memorizes the last calls with static variables,

   function DOMXpath_reuser($file) {
      static $doc=NULL;
      static $docName='';
      static $xp=NULL;
      if (!$doc)
                $doc = new DOMDocument();
      if ($file!=$docName) {
                $doc->loadHTMLFile($file);
                $xp = NULL;
      }
      if (!$xp) 
                $xp = new DOMXpath($doc);
      return $xp;  // ??RETURNED VALUES ARE ALWAYS STABLE??
   }

The present question is similar to this other one about XSLTProcessor reuse. In both questions the problem can be generalized for any language or framework that use LibXML2 as DomDocument implementation.

There are another related question: How to "refresh" DOMDocument instances of LibXML2?

Illustrating

The reuse is very commom (examples):

   $f = "my_XML_file.xml";
   $elements = DOMXpath_reuser($f)->query("//*[@id]");
   // use elements to get information
   $elements = DOMXpath_reuser($f)->("/html/body/div[1]");
   // use elements to get information

But, if you do something like removeChild , replaceChild , etc. (example),

   $div = DOMXpath_reuser($f)->query("/html/body/div[1]")->item(0);  //STABLE
   $div->parentNode->removeChild($div);                // CHANGES DOM
   $elements = DOMXpath_reuser($f)->query("//div[@id]"); // INSTABLE! !!

extrange things can be occur , and the queries not works as expected!!

When (what DOMDocument methods affect XPath?)
Why we can not use something like normalizeDocument to "refresh DOM" (exist?)?
Only a "new DOMXpath($doc);" is allways secure? need to reload $doc also?

Answer 1

DOMXpath is affected by the load*() methods on DOMDocument. After loading a new xml or html, you need to recreate the DOMXpath instance:

$xml = '<xml/>';    
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);

var_dump($xpath->document === $dom); // bool(true)

$dom->loadXml($xml);

var_dump($xpath->document === $dom); // bool(false)

In DOMXpath_reuser() you store a static variable and recreate the xpath depending on the file name. If you want to reuse an Xpath object, suggest extending DOMDocument. This way you only need pass the $dom variable around. It would work with a stored xml file as well with xml string or a document your are creating.

The following class extends DOMDocument with an method xpath() that always returns a valid DOMXpath instance for it. It stores and registers the namespaces, too:

class MyDOMDocument
  extends DOMDocument {

  private $_xpath = NULL;
  private $_namespaces = array();

  public function xpath() {
    // if the xpath instance is missing or not attached to the document
    if (is_null($this->_xpath) || $this->_xpath->document != $this) {
      // create a new one
      $this->_xpath = new DOMXpath($this);
      // and register the namespaces for it
      foreach ($this->_namespaces as $prefix => $namespace) {
        $this->_xpath->registerNamespace($prefix, $namespace);
      }
    }
    return $this->_xpath;
  }

  public function registerNamespaces(array $namespaces) {
    $this->_namespaces = array_merge($this->_namespaces, $namespaces);
    if (isset($this->_xpath)) {
      foreach ($namespaces as $prefix => $namespace) {
        $this->_xpath->registerNamespace($prefix, $namespace);
      }
    }
  }
}

$xml = <<<'ATOM'
  <feed xmlns="http://www.w3.org/2005/Atom">
    <title>Test</title>
  </feed>
ATOM;


$dom = new MyDOMDocument();
$dom->registerNamespaces(
  array(
    'atom' => 'http://www.w3.org/2005/Atom'
  )
);
$dom->loadXml($xml);
// created, first access
var_dump($dom->xpath()->evaluate('string(/atom:feed/atom:title)', NULL, FALSE));
$dom->loadXml($xml);
// recreated, connection was lost
var_dump($dom->xpath()->evaluate('string(/atom:feed/atom:title)', NULL, FALSE));

Answer 2

The DOMXpath class (instead of XSLTProcessor in your another question ) use reference to given DOMDocument object in contructor. DOMXpath create libxml context object based on given DOMDocument and save it to internal class data. Besides libxml context it s saves references to original DOMDocument` given in contructor arguments.

What that means:

Part of sample from ThomasWeinert answer:

var_dump($xpath->document === $dom); // bool(true)  
$dom->loadXml($xml);    
var_dump($xpath->document === $dom); // bool(false)

gives false after load becouse of $dom already holds pointer to new libxml data but DOMXpath holds libxml context for $dom before load and pointer to real document after load.

Now about query works

If it should return XPATH_NODESET (as in your case) its make a node copy - node by node iterating throw detected node set( \\ext\\dom\\xpath.c from 468 line). Copy but with original document node as parent . Its means that you can modify result but this gone away you XPath and DOMDocument connection.

XPath results provide a parentNode memeber that knows their origin:

for attribute values, parentNode returns the element that carries them. An example is //foo/@attribute, where the parent would be a foo Element.
for the text() function (as in //text()), it returns the element that contains the text or tail that was returned.
note that parentNode may not always return an element. For example, the XPath functions string() and concat() will construct strings that do not have an origin. For them, parentNode will return None.

So,

There is no any reasons to cache XPath . It do not anything besides xmlXPathNewContext (just allocate lightweight internal struct ).
Each time your modify your DOMDocument (removeChild, replaceChild, etc.) your should recreate XPath .
We can not use something like normalizeDocument to "refresh DOM" because of it change internal document structure and invalidate xmlXPathNewContext created in Xpath constructor.
Only "new DOMXpath($doc);" is allways secure? Yes, if you do not change $doc between Xpath usage. Need to reload $doc also - no, because of it invalidated previously created xmlXPathNewContext .

Answer 3

(this is not a real answer, but a consolidation of comments and answers posted here and related questions)

This new version of the question's DOMXpath_reuser function contains the @ThomasWeinert suggestion (for avoid DOM changes by external re- load ) and an option $enforceRefresh to workaround the problem of instability (as related question shows the programmer must detect when ).

   function DOMXpath_reuser_v2($file, $enforceRefresh=0) {  //changed here
      static $doc=NULL;
      static $docName='';
      static $xp=NULL;
      if (!$doc)
                $doc = new DOMDocument();
      if ( $file!=$docName || ($xp && $doc !== $xp->document) ) { // changed here
                $doc->load($file);
                $xp = NULL;
      } elseif ($enforceRefresh==2) {  // add this new refresh mode
                $doc->loadXML($doc->saveXML());
                $xp = NULL;
      }
      if (!$xp || $enforceRefresh==1)  //changed here
                $xp = new DOMXpath($doc);
      return $xp;
   }

When must to use $enforceRefresh=1 ?

... perhaps an open problem, only little tips and clues...

when DOM submited to setAttribute, removeChild, replaceChild, etc.
...? more cases?

When must to use $enforceRefresh=2 ?

... perhaps an open problem, only little tips and clues...

when DOM was subject to indexes inconsistences, etc. See this question/solution .
...? more cases?

In DomDocument, reuse of DOMXpath, it is stable?

Question

Illustrating

3 answers

solution1
3 2013-11-21 15:47:26

solution2
2 ACCPTED 2013-11-22 11:08:11

solution3
1 2013-11-21 17:53:07

When must to use $enforceRefresh=1 ?

When must to use $enforceRefresh=2 ?

In DomDocument, reuse of DOMXpath, it is stable?

Question

Illustrating

3 answers

solution1 3 2013-11-21 15:47:26

solution2 2 ACCPTED 2013-11-22 11:08:11

solution3 1 2013-11-21 17:53:07

When must to use $enforceRefresh=1 ?

When must to use $enforceRefresh=2 ?

solution1
3 2013-11-21 15:47:26

solution2
2 ACCPTED 2013-11-22 11:08:11

solution3
1 2013-11-21 17:53:07