简体   繁体   English

为什么LINQ to XML不能像'\ x1A'那样转义字符?

[英]Why does LINQ to XML not escape characters like '\x1A'?

I get exception if in XElement 's content I include characters such as '\\x1A', '\\x1B', '\\x1C', '\\x1D', '\\x1E' or '\\x1F'. 如果在XElement的内容中我包含'\\ x1A','\\ x1B','\\ x1C','\\ x1D','\\ x1E'或'\\ x1F'等字符,我会遇到异常。

using System;
using System.Collections.Generic;
using System.Xml.Linq;

namespace LINQtoXMLInvalidChars
{
    class Program
    {
        private static readonly IReadOnlyCollection<char> InvalidCharactersInXml = new List<char>
        {
            '<',
            '>',
            '&',
            '\'',
            '\"',
            '\x1A',
            '\x1B',
            '\x1C',
            '\x1D',
            '\x1E',
            '\x1F'
        };

        static void Main()
        {
            foreach (var c in InvalidCharactersInXml)
            {
                var xEl = new XElement("tag", "Character: " + c);
                var xDoc = new XDocument(new XDeclaration("1.0", "utf-8", null), xEl);

                try
                {
                    Console.Write("Writing " + c + ": ");
                    Console.WriteLine(xDoc);
                }
                catch (Exception e)
                {
                    Console.WriteLine("Oops.    " + e.Message);
                }
            }

            Console.ReadKey();
        }
    }
}

In an answer from Jon Skeet to the question String escape into XML I read 在Jon Skeet的回答中,我读到了String escape to XML的问题

You set the text in a node, and it will automatically escape anything it needs to. 您在节点中设置文本,它将自动转义它需要的任何内容。

So now I'm confused. 所以现在我很困惑。 Do I misunderstand something? 我误解了什么吗?

Some background information: The string content of the XElement comes from the end user. 一些背景信息: XElement的字符串内容来自最终用户。 I see two options for making my application robust: 1) to Base-64 encode the string before passing it in to XElement 2) to narrow the accepted set of characters to eg alphanumeric characters. 我看到了使应用程序健壮的两个选项: 1)在将字符串传递给XElement 2之前对Base-64进行编码,以将接受的字符集缩小到例如字母数字字符。

Most of those characters simply aren't valid in XML 1.0 at all. 大多数这些字符根本不在XML 1.0中有效。 Personally I wish that LINQ to XML would fail to produce a document that later it wouldn't be able to parse, but basically you should avoid them. 我个人希望LINQ to XML无法生成一个后来无法解析的文档,但基本上你应该避免使用它们。

I would also recommend avoiding \\x as an escape sequence anyway, preferring \\u\u003c/code> - the fact that \\x will take "up to" 4 hex digits can be very confusing. 我还建议尽量避免\\x作为转义序列,更喜欢\\u\u003c/code> - \\x将取“最多”4个十六进制数字这一事实可能会非常混乱。

From the XML 1.0 spec : XML 1.0规范

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Now U+000D and U+000A are interesting cases - they won't be escaped in text nodes; 现在U + 000D和U + 000A是有趣的案例 - 它们不会在文本节点中被转义; they'll just be included verbatim. 他们只是逐字被包括在内。 Whether or not that's then present when you parse the node will depend on parse settings (and whether there are non-whitespace characters around it). 解析节点时是否存在将取决于解析设置(以及它周围是否存在非空白字符)。

In terms of how to handle this in your case: you definitely have options of: 就你的情况如何处理这个问题而言:你肯定有以下选择:

  • Performing your own encoding/escaping. 执行您自己的编码/转义。 This is generally somewhat painful, and will lead to XML documents which are hard to read compared with regular ones. 这通常有点痛苦,并且会导致与常规XML文档相比难以阅读的XML文档。 You could potentially do this only when required, adding an attribute to the element to say that you've done it, for example. 在需要时,您可能只能做到这一点,添加属性的元素说,你已经做了,例如。
  • Detect and remove characters which are invalid in XML 检测并删除XML中无效的字符
  • Detect and reject strings containing characters which are invalid in XML 检测并拒绝包含XML中无效字符的字符串

We can't really tell which of these is most appropriate in your scenario. 我们无法确定哪种方式最适合您的方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM