how can I remove additional </html> tag from string using regular expression?

Question

I am using php DOMDocument to replace a node and then rewrite the page. The HTML that is written back is plain text (not HTML) so I had to convert it like so:

$content = files::readFile($data['page_path']);
$content = str_replace('&lt;', '<', $content);
$content = str_replace('&gt;', '>', $content);

if (!@fwrite($handle, $content))
{
    print 'Failed to replace entities';
    return FALSE;
}

This makes the HTML proper however, for some odd reason, it adds an extra < / html > tag to the bottom of the document with some additional data after the offending < / html > tag. I am at a total loss as to why.

Anyway, I thought about using:

$content = preg_replace('#\<\/head\>*(:alphanum:)#', '</html>', $content);

to remove it but this doesn't match the way I thought it would.

Help please!

Testing example:

$html = '
   <div id="footer">
       <div class="wrap">
           <strong class="logo"><a href="#">College</a></strong>
           <ul><li><a href="#">Emergencies</a></li>
               <li><a href="#">Contact</a></li>
               <li><a href="#">Copyright</a></li>
               <li><a href="#">Terms of Use</a></li>
               <li><a href="#">Member of The Colleges</a></li>
           </ul><p>© 2010 College</p>
       </div>
   </div>
</body></html>
li>
               <li><a href="#">Contact</a></li>
               <li><a href="#">Copyright</a></li>
               <li><a href="#">Terms of Use</a></li>
               <li><a href="#">Member of The Colleges</a></li>
           </ul><p>© 2010 College</p>
       </div>
   </div>
</body></html>';

preg_match("#</head>.*#si", $html, $matches);
var_dump($matches);

Answer 1

Your code:

$content = preg_replace('#\<\/head\>*(:alphanum:)#', '</html>', $content);

The reason this doesn't work is because :alphanum: is not valid syntax in PREG-style regexes. Character classes like this were allowed in PHP's ereg-style regex functions, but these functions have been deprecated, so you should stick with PREG.

The solution is to replace :alphanum: with the correct PREG syntax.

In the case of :alphanum: , the correct syntax is [0-9a-zA-Z] .

There are also short-cuts such as \\d for any numeric character, which could be used instead of 0-9 . See http://www.regular-expressions.info/reference.html for more info on these.

[EDIT]

So now you have:

$content = preg_replace('#\<\/head\>*([0-9a-zA-Z])#', '</html>', $content);

As you say, that still doesn't work, but at least the obvious problem is dealt with.

The first question: You've got "" in the regex, but you seem to be asking to replace everything after "" -- is the "" a typo?

Second point: The regex [0-9a-zA-Z] will only match one single character. To match multiple characters, you need to add either a + or a * after the [0-9a-zA-Z] . The plus sign indicates that you want to match one or more of [0-9a-zA-Z] , whereas the asterisk indicates zero or more occurrances.

Third point: The code you've quoted in the question includes more than just alphanumeric characters. It is HTML, so it includes angle brackets, #-symbols, even a copyright symbol. Obviously [0-9a-zA-Z] won't match any of them (nor would :alphanum: using ereg syntax).

You could add all the characters that could appear in the code to your regex, to make something like [0-9a-zA-Z<>#] , but that's probably not practical. A better idea would be to match any characters, rather than specifically looking for alphanumerics. To match any character, you would use a dot ( . ), so your regex would look more like this:

$content = preg_replace('#</html>.+#', '</html>', $content);

Hope that helps.

Answer 2

I'm not sure exactly how the alphanumeric characters fit in with the problem, but I suspect you want the * wildcard after the alphanumeric character in order to match any number of them:

$content = preg_replace('#</head>[\da-z]*#i', '</html>', $content);

In fact, I wonder if you might even want to match everything after the </head> tag, in which case this might work for you:

$content = preg_replace('#</head>.*#si', '</html>', $content);

Edit: Now that you've added example text, I see you have 2 </html> tags. How about:

$content = preg_replace('#</html>.*#si', '</html>', $content);

Answer 3

The problem I was experiencing has been solved: I figured out the strange bug I have been experiencing in the reusable content! I found the issue in my use of PHP's function fwrite() when using mode 'r+'. If you see the documentation for this function at php.net/fopen, you will see that r+ does the following: Open for reading and writing; place the file pointer at the beginning of the file. I naively assumed that this meant that since the pointer was at the beginning, it would overwrite the entire file contents. No, in fact this is not the truth. If you want that effect, you have to use mode 'w' which does the following: Open for writing only; place the file pointer at the beginning of the file and truncate the file to zero length. If the file does not exist, attempt to create it.

how can I remove additional </html> tag from string using regular expression?

Question

3 answers

solution1
0 2011-03-01 18:53:47

solution2
0 2011-03-01 20:36:47

solution3
0 ACCPTED

how can I remove additional </html> tag from string using regular expression?

Question

3 answers

solution1 0 2011-03-01 18:53:47

solution2 0 2011-03-01 20:36:47

solution3 0 ACCPTED

solution1
0 2011-03-01 18:53:47

solution2
0 2011-03-01 20:36:47

solution3
0 ACCPTED