How can I escape HTML character entities when using ColdFusion function XMLFormat()?

Question

I have the following block of HTML:

<p>The quick brown fox jumps over the lazy dog &mdash; The quick brown fox jumps over the lazy dog.</p>
<p>The quick brown fox jumps over the lazy dog &mdash; The quick brown fox jumps over the lazy dog.
<br>The quick brown fox jumps over the lazy dog &mdash; The quick brown fox jumps over the lazy dog.

It is NOT valid XHTML. However, I need to include this HTML in an XML document. I tried using XMLFormat() in order to convert the < to < and the > to > , which works great. Unfortunately, it also converts — to &mdash; , which is not valid and throws an exception in the CFXML tag.

<cfxml variable="myXML">
    <content>#XMLFormat(myHTML)#</content>
</cfxml>

How can I workaround this?

Answer 1

You have a few options. A lot depends on how this content is going to be used. It would be extremely helpful to include a desired output document, as well as indicate where this xml is being used.

If you don't want to mess with the content of the HTML at all, you could always use CDATA , like this:

<cfxml variable="myXML">
    <content><![CDATA[#myHTML#]]></content>
</cfxml>

Also, I know you say you don't want to convert the remaining ampersands but I just don't see how this is so. Either the HTML content is a string you want to process -- in which case, all of it should be escaped so that it can be unescaped later -- or it's valid XML that you want to be part of the document. I mean, when you process the contents of the <content> tag later on, you will run into problems if the ampersands aren't escaped.

Answer 2

Unfortunately this answer:

<cfxml variable="myXML">
    <content><![CDATA[#myHTML#]]></content>
</cfxml>

is insufficient if you happen to have invalid html that you want to display. consider the case where myHTML contains:

<p>some invalid html ]]><script>alert('foo')</script>

As far as I can tell there is no supported way in coldfusion to do proper encoding of potentially invalid data. Your best bet is to write yourself a filter function that entity encodes html special and illegal characters.

Answer 3

It's tough when you have some HTML partially converted, and then need to do the rest...

You could replace all the "&" signs temporarily, run the XMLFormat, then convert the "&" signs back.

<cfscript>
// replace & signs with a temp placeholder
myHTML = replace(myHTML, "&", "*amp*", "all");

// format for XML
myHTML = XMLFormat(myHTML);

// replace placeholders with & signs
myHTML = replace(myHTML, "*amp*", "&", "all");
</cfscript>

If it works, you could make this one step by wrapping this logic in a single function.

Answer 4

How about simply not using — escape in the source string and instead including the ?? character in-situ.

Edit :

I'm gonna guess that the HTML content stored in the database is not known to be XHTML compliant and hence to put it in an XML document you have no choice but to either place it in a CDATA section or encode it correctly. There is an assumption that placing it in an XML document like this is useful and that it can be properly decoded at the consuming end. This will be true of either approach if a typical XML DOM is used at the consumer.

So this leads me to this quesion, whats actually wrong with &mdash ? After all < will result in < etc. When retrieved from a DOM by the consumer the resulting string will be returned to using — and < and so on, when subsequently used in as HTML all will be well.

Answer 5

HTMLEditFormat(string) should convert your less-than and greater-than signs, but will also handle the ampersand. I understand that you want to leave the — as-is. It is worth pointing out that — is not one of XML's predefined entities (although you can define it).

I just thought I'd mention it, as HTMLEditFormat() is ideal for escaping HTML to include in XML documents, such as Atom feeds. It sounds like it is not the solution for your specific use case, however.

Answer 6

目前，我只是分别用“ < ”和“ > ”替换所有小于和大于字符的字符。

Answer 7

In this specific use case, you can use URLEncodedFormat() to preserve the natural form of the content, and then use URLDecode() on the way out.

<cfxml variable="content">
    <content><cfoutput>#URLEncodedFormat(myHTML)#</cfoutput></content>
</cfxml>
<cfset xml = xmlParse(content)>
<cfoutput>#URLDecode(xml.content.xmltext)#</cfoutput>

I'm not recommending this as a best practice, only that it would work in the scenario posed by the question.

How can I escape HTML character entities when using ColdFusion function XMLFormat()?

Question

7 answers

solution1
8 2010-02-02 22:16:38

solution2
3 2011-05-04 23:15:46

solution3
1 2010-02-02 22:10:10

solution4
1 2010-02-02 22:13:03

solution5
0 2012-05-31 14:44:10

solution6
0 2010-02-04 15:43:34

solution7
0 2010-02-07 20:57:06

How can I escape HTML character entities when using ColdFusion function XMLFormat()?

Question

7 answers

solution1 8 2010-02-02 22:16:38

solution2 3 2011-05-04 23:15:46

solution3 1 2010-02-02 22:10:10

solution4 1 2010-02-02 22:13:03

solution5 0 2012-05-31 14:44:10

solution6 0 2010-02-04 15:43:34

solution7 0 2010-02-07 20:57:06

solution1
8 2010-02-02 22:16:38

solution2
3 2011-05-04 23:15:46

solution3
1 2010-02-02 22:10:10

solution4
1 2010-02-02 22:13:03

solution5
0 2012-05-31 14:44:10

solution6
0 2010-02-04 15:43:34

solution7
0 2010-02-07 20:57:06