简体   繁体   English

使用ColdFusion函数XMLFormat()时,如何转义HTML字符实体?

[英]How can I escape HTML character entities when using ColdFusion function XMLFormat()?

I have the following block of HTML: 我有以下HTML块:

<p>The quick brown fox jumps over the lazy dog &mdash; The quick brown fox jumps over the lazy dog.</p>
<p>The quick brown fox jumps over the lazy dog &mdash; The quick brown fox jumps over the lazy dog.
<br>The quick brown fox jumps over the lazy dog &mdash; The quick brown fox jumps over the lazy dog.

It is NOT valid XHTML. 它不是有效的XHTML。 However, I need to include this HTML in an XML document. 但是,我需要将此HTML包含在XML文档中。 I tried using XMLFormat() in order to convert the < to &lt; 我尝试使用XMLFormat()来转换< to &lt; and the > to &gt; >&gt; , which works great. ,效果很好。 Unfortunately, it also converts &mdash; 不幸的是,它也转换了&mdash; to &amp;mdash; &amp;mdash; , which is not valid and throws an exception in the CFXML tag. ,这是无效的并在CFXML标记中引发异常。

<cfxml variable="myXML">
    <content>#XMLFormat(myHTML)#</content>
</cfxml>

How can I workaround this? 我该如何解决这个问题?

You have a few options. 你有几个选择。 A lot depends on how this content is going to be used. 很大程度上取决于如何使用这些内容。 It would be extremely helpful to include a desired output document, as well as indicate where this xml is being used. 包含所需的输出文档以及指示此xml的使用位置非常有用。

If you don't want to mess with the content of the HTML at all, you could always use CDATA , like this: 如果您根本不想弄乱HTML的内容,您可以随时使用CDATA ,如下所示:

<cfxml variable="myXML">
    <content><![CDATA[#myHTML#]]></content>
</cfxml>

Also, I know you say you don't want to convert the remaining ampersands but I just don't see how this is so. 此外,我知道你说你不想转换剩下的&符号,但我只是不知道这是怎么回事。 Either the HTML content is a string you want to process -- in which case, all of it should be escaped so that it can be unescaped later -- or it's valid XML that you want to be part of the document. HTML内容是您要处理的字符串 - 在这种情况下,所有内容都应该被转义以便以后可以转义 - 或者它是您希望成为文档一部分的有效XML。 I mean, when you process the contents of the <content> tag later on, you will run into problems if the ampersands aren't escaped. 我的意思是,当您稍后处理<content>标记的<content>时,如果&符号未被转义,您将遇到问题。

Unfortunately this answer: 不幸的是这个回答

<cfxml variable="myXML">
    <content><![CDATA[#myHTML#]]></content>
</cfxml>

is insufficient if you happen to have invalid html that you want to display. 如果你碰巧有想要显示的无效html,那就不够了。 consider the case where myHTML contains: 考虑myHTML包含的情况:

<p>some invalid html ]]><script>alert('foo')</script>

As far as I can tell there is no supported way in coldfusion to do proper encoding of potentially invalid data. 据我所知,coldfusion中没有支持的方法来对潜在的无效数据进行正确的编码。 Your best bet is to write yourself a filter function that entity encodes html special and illegal characters. 最好的办法是给自己编写一个过滤函数,该函数对实体编码html特殊和非法字符。

It's tough when you have some HTML partially converted, and then need to do the rest... 当你有一些HTML部分转换时,这很难,然后需要做其余的...

You could replace all the "&" signs temporarily, run the XMLFormat, then convert the "&" signs back. 您可以暂时替换所有“&”符号,运行XMLFormat,然后将“&”符号转换回来。

<cfscript>
// replace & signs with a temp placeholder
myHTML = replace(myHTML, "&", "*amp*", "all");

// format for XML
myHTML = XMLFormat(myHTML);

// replace placeholders with & signs
myHTML = replace(myHTML, "*amp*", "&", "all");
</cfscript>

If it works, you could make this one step by wrapping this logic in a single function. 如果它有效,你可以通过将这个逻辑包装在一个函数中来实现这一步。

How about simply not using &mdash; 如何简单地不使用&mdash; escape in the source string and instead including the ?? 在源字符串中转义,而不是包括?? character in-situ. 原位人物。

Edit : 编辑

I'm gonna guess that the HTML content stored in the database is not known to be XHTML compliant and hence to put it in an XML document you have no choice but to either place it in a CDATA section or encode it correctly. 我猜测存储在数据库中的HTML内容不知道是否符合XHTML,因此要将它放在XML文档中,你别无选择,只能将它放在CDATA部分或正确编码。 There is an assumption that placing it in an XML document like this is useful and that it can be properly decoded at the consuming end. 假设将它放在这样的XML文档中是有用的,并且可以在消费端正确解码。 This will be true of either approach if a typical XML DOM is used at the consumer. 如果在消费者处使用典型的XML DOM,则这种方法都是如此。

So this leads me to this quesion, whats actually wrong with &amp;mdash ? 所以这引出了我这个问题,什么是&amp;mdash实际上是错的? After all < will result in &lt; 毕竟<将导致&lt; etc. When retrieved from a DOM by the consumer the resulting string will be returned to using &mdash; 当消费者从DOM中检索时,结果字符串将返回使用&mdash; and < and so on, when subsequently used in as HTML all will be well. <等等,当后来用作HTML时一切都会好的。

HTMLEditFormat(string) should convert your less-than and greater-than signs, but will also handle the ampersand. HTMLEditFormat(字符串)应该转换小于号和大于号的符号,但也会处理&符号。 I understand that you want to leave the &mdash; 我明白你想离开&mdash; as-is. 原样。 It is worth pointing out that &mdash; 值得指出的是&mdash; is not one of XML's predefined entities (although you can define it). 不是XML的预定义实体之一 (尽管您可以定义它)。

I just thought I'd mention it, as HTMLEditFormat() is ideal for escaping HTML to include in XML documents, such as Atom feeds. 我只是想提一下,因为HTMLEditFormat()非常适合转义HTML以包含在XML文档中,例如Atom提要。 It sounds like it is not the solution for your specific use case, however. 但是,听起来它不是您特定用例的解决方案。

目前,我只是分别用“ &lt; ”和“ &gt; ”替换所有小于和大于字符的字符。

In this specific use case, you can use URLEncodedFormat() to preserve the natural form of the content, and then use URLDecode() on the way out. 在这个特定的用例中,您可以使用URLEncodedFormat()来保留内容的自然形式,然后在出路时使用URLDecode()。

<cfxml variable="content">
    <content><cfoutput>#URLEncodedFormat(myHTML)#</cfoutput></content>
</cfxml>
<cfset xml = xmlParse(content)>
<cfoutput>#URLDecode(xml.content.xmltext)#</cfoutput>

I'm not recommending this as a best practice, only that it would work in the scenario posed by the question. 我不建议将此作为最佳实践,只是它可以在问题提出的场景中起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM