简体   繁体   English

XML内的HTML。 我应该使用CDATA还是对HTML进行编码

[英]Html inside XML. Should I use CDATA or encode the HTML

I am using XML to share HTML content. 我正在使用XML共享HTML内容。 AFAIK, I could embed the HTML either by: AFAIK,我可以通过以下方式嵌入HTML:

  • Encoding it: I don't know if it is completely safe to use. 对其进行编码:我不知道它是否完全安全。 And I would have to decode it again. 而且我将不得不再次对其进行解码。

  • Use CDATA sections: I could still have problems if the content contains the closing tag "]]>" and certain hexadecimal characters, I believe. 使用CDATA部分:如果内容包含结束标记“]]>”和某些十六进制字符,我仍然会遇到问题。 On the other hand, the XML parser would extract the info transparently for me. 另一方面,XML解析器将为我透明地提取信息。

Which option should I choose? 我应该选择哪个选项?

UPDATE: The xml will be created in java and passed as a string to a .net web service, were it will be parsed back. 更新:将使用Java创建xml,并将其作为字符串传递给.net Web服务,否则将对其进行解析。 Therefore I need to be able to export the xml as a string and load it using "doc.LoadXml(xmlString);" 因此,我需要能够将xml导出为字符串并使用“ doc.LoadXml(xmlString);”加载它。

The two options are almost exactly the same. 这两个选项几乎完全相同。 Here are your two choices: 这是您的两个选择:

<html>This is &lt;b&gt;bold&lt;/b&gt;</html>

<html><![CDATA[This is <b>bold</b>]]></html>

In both cases, you have to check your string for special characters to be escaped. 在这两种情况下,都必须检查字符串中是否要转义特殊字符。 Lots of people pretend that CDATA strings don't need any escaping, but as you point out, you have to make sure that "]]>" doesn't slip in unescaped. 很多人假装CDATA字符串不需要任何转义,但是正如您指出的那样,您必须确保“]]>”不会被转义。

In both cases, the XML processor will return your string to you decoded. 在这两种情况下,XML处理器都会将您的字符串返回给您解码。

CDATA is easier to read by eye while encoded content can have end of CDATA markers in it safely — but you don't have to care. CDATA易于阅读,而编码后的内容可以安全地在其中包含CDATA标记的结尾-但您不必在意。 Just use an XML library and stop worrying about it. 只需使用XML库,就不必再为此担心了。 Then all you have to say is "Put this text inside this element" and the library will either encode it or wrap it in CDATA markers. 然后,您只需要说“将此文本放入此元素内”即可,该库将对其进行编码或将其包装在CDATA标记中。

CDATA为简单起见。

If you use CDATA, then you must decode it correctly (textContent, value and innerHTML are methods that will NOT return the proper data). 如果使用CDATA,则必须正确解码(textContent,value和innerHTML是不会返回正确数据的方法)。

let us say that you use an xml structure similar to this: 假设您使用类似于以下内容的xml结构:

<response>
    <command method="setcontent">
        <fieldname>flagOK</fieldname>
        <content>479</content>
    </command>
    <command method="setcontent">
        <fieldname>htmlOutput</fieldname>
        <content>
            <![CDATA[
            <tr><td>2013/12/05 02:00 - 2013/12/07 01:59 </td></tr><tr><td width="90">Rastreado</td><td width="60">Placa</td><td width="100">Data hora</td><td width="60" align="right">Km/h</td><td width="40">Direção</td><td width="40">Azimute</td><td>Mapa</td></tr><tr><td>Silverado</td><td align='left'>CQK0052</td><td>05/12/2013 13:55</td><td align='right'>113</td><td align='right'>NE</td><td align='right'>40</td><td><a href="http://maps.google.com/maps?q=-22.6766,-50.2218&amp;iwloc=A&amp;t=h&amp;z=18" target="_blank">-22.6766,-50.2218</a></td></tr><tr><td>Silverado</td><td align='left'>CQK0052</td><td>05/12/2013 13:56</td><td align='right'>112</td><td align='right'>NE</td><td align='right'>23</td><td><a href="http://maps.google.com/maps?q=-22.6638,-50.2106&amp;iwloc=A&amp;t=h&amp;z=18" target="_blank">-22.6638,-50.2106</a></td></tr><tr><td>Silverado</td><td align='left'>CQK0052</td><td>05/12/2013 18:00</td><td align='right'>111</td><td align='right'>SE</td><td align='right'>118</td><td><a href="http://maps.google.com/maps?q=-22.7242,-50.2352&amp;iwloc=A&amp;t=h&amp;z=18" target="_blank">-22.7242,-50.2352</a></td></tr>
            ]]>
        </content>
    </command>
</response>

in javascript, then you will decode by loading the xml (jquery, for example) into a variable like xmlDoc below and then getting the nodeValue for the 2nd occurence ( item(1) ) of the content tag 在javascript中,那么您将通过以下方式进行解码:将xml(例如jquery)加载到下面的xmlDoc之类的变量中,然后获取content标签第二次出现的nodeValue( item(1)

xmlDoc.getElementsByTagName("content").item(1).childNodes[0].nodeValue

or (both notations are equivalent) 或(两种表示法都等效)

xmlDoc.getElementsByTagName("content")[1].childNodes[0].nodeValue

I don't know what XML builder you're using, but PHP (actually libxml) knows how to handle ]]> inside CDATA sections, and so should every other XML framework. 我不知道您使用的是哪种XML构建器,但是PHP(实际上是libxml)知道如何在CDATA部分中处理]]> ,所有其他XML框架也应如此。 So, I'd use a CDATA section. 因此,我将使用CDATA部分。

It makes sense to wrap HTML in CDATA. 将HTML封装在CDATA中是有意义的。 The HTML text will probably constitute on single value in XML. HTML文本可能会构成XML中的单个值。

So not wrapping it in CDATA will cause all xml parsers to read it as a part of the XML document. 因此,不将其包装在CDATA中将导致所有xml解析器将其读取为XML文档的一部分。 While it is easy to circumvent this problem while using the xml, why the extra headache? 尽管在使用xml时很容易解决这个问题,但为什么还要头疼呢?

If you want to actually parse the HTML into a DOM, then its better to read the HTML text, and setup a parser to read the test separately. 如果您想将HTML实际解析为DOM,则最好读取HTML文本,并设置一个解析器以分别读取测试。

Hope that came out the way I intended it to. 希望以我预期的方式出现。

Personally, I hate CDATA segments, so I'd use encoding instead. 就个人而言,我讨厌CDATA段,因此我会改用编码。 Of course, if you add XML to XML to XML then this would result in encoding over encoding over encoding and thus some very unreadable results. 当然,如果将XML添加到XML到XML,那么这将导致编码超过编码而不是编码,从而导致一些非常难以理解的结果。 Why I hate CDATA segments? 为什么我讨厌CDATA段? I wish I knew. 我希望我知道。 Personal preference, mostly. 个人喜好居多。 I just don't like getting used to adding "forbidden characters" inside a special segment where they would suddenly be allowed again. 我只是不喜欢习惯在特殊区域内添加“禁止的字符”,而这些区域突然又被允许使用。 It just confuses me when I see XML mark-up within a CDATA segment and it's not part of the XML surrounding it. 当我在CDATA段中看到XML标记时,这使我感到困惑,并且它不是围绕它的XML的一部分。 At least with encoding I will see that it's encoded. 至少使用编码,我会看到它已编码。

Good XML libraries will handle both encoding and CDATA segments transparently. 好的XML库将透明地处理编码和CDATA段。 It's just my eyes that get hurt. 只是我的眼睛受伤了。

Encoding it will work fine and is reliable. 对其进行编码会很好并且可靠。 You can encode encoded sections etc. without any difficulty. 您可以毫无困难地对编码的片段进行编码。

Decoding will be done automatically by whatever XML parser is used to handle your encoded HTML. 不管使用哪种XML解析器来处理编码的HTML,解码都会自动完成。

i think the answer depends on what you are planning to do with the html content, and also what type of html content you plan to support. 我认为答案取决于您打算使用html内容做什么,以及您计划支持哪种类型的html内容。

Especially when it comes to included javascript, encoding often results in problems. 特别是在包含JavaScript时,编码经常会导致问题。 CDATA definitely helps you there. CDATA绝对可以为您提供帮助。

If you plan to use only small snippets (ie. a paragraph) and have a way to preprocess/filter it (because oyu dont want javascript or fancy things anyways), you will probably be better off with encoding or actually just putting it directly as subtree in the xml. 如果您打算只使用小片段(例如一段),并有一种预处理/过滤的方式(因为oyu无论如何都不想要javascript或其他奇特的东西),那么编码或直接将其直接作为xml中的子树。 You can then also post-process the html (ie filter style or onclick attributes). 然后,您还可以对html进行后处理(即过滤器样式或onclick属性)。 But this is definitely more work. 但这绝对是更多的工作。

You can use combination of both. 您可以将两者结合使用。 For example: you want to pass <h1>....</h1> in xml node you have use CDATA section to pass it. 例如:您要在xml节点中传递<h1>....</h1> ,您已使用CDATA节来传递它。 Contents inside <h1>...</h1> must be encoded to html entities like eg &lt; <h1>...</h1>必须编码为html实体,例如&lt; , for < . ,用于< Encoding between tags will solve the problem of ]]> getting interprited as it gets converted to ]]&gt; 标签之间的编码将解决]]>在转换为]]&gt; >时被截取的问题]]&gt; and html tags do not contain ]]> . 和html标记不包含]]>

You can do this only if html is generated by yourself. 仅当您自己生成html时,您才能执行此操作。

If your HTML is well-formed, then just embed the HTML tags without escaping or wrapping in CDTATA. 如果您的HTML格式正确,则只需嵌入HTML标签,而无需在CDTATA中转义或包装。 If at all possible, it helps to keep your content in XML. 如果有可能,将内容保留为XML是有帮助的。 It gives you more flexibility for transforming and manipulating the document. 它为您提供了更大的灵活性来转换和处理文档。

You could set a namespace for the HTML, so that you could disambiguate your HTML tags from the other XML wrapping it. 您可以为HTML设置一个名称空间,以便可以将HTML标记与其他XML包装在一起。

Escaped text means that the entire HTML block will be one big text node. 转义的文本意味着整个HTML块将是一个大文本节点。 Wrapping in CDATA tells the XML parser not to parse that section. CDATA中的包装告诉XML解析器不要解析该部分。 It may be "easier", but limits your abilities downrange and should only be employed when appropriate; 它可能会“更轻松”,但会限制您的能力下移,因此仅在适当时使用; not just because it is more convenient. 不只是因为它更方便。 Escaped markup is considered harmful. 转义的标记被认为是有害的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM