简体   繁体   English

如何美化不完整的 XML 文档

[英]How to beautify incomplete XML documents

I look for a way to beautify incomplete XML documents.我正在寻找一种美化不完整 XML 文档的方法。 In best case it should handle even large sizes (eg 10 MB or maybe 100 MB).在最好的情况下,它甚至应该处理大尺寸(例如 10 MB 或 100 MB)。

Incomplete means that the documents are truncated at a random position.不完整意味着文档在随机位置被截断 Until this position the XML has a valid syntax.在此位置之前,XML 具有有效的语法。 Beautify means to add line breaks and leading spaces between the tags.美化意味着在标签之间添加换行符和前导空格。

In my case it's needed to analyse aborted streams.在我的情况下,需要分析中止的流。 Without line breaks and indentions it's really hard to read for a human.没有换行符和缩进,对于人类来说真的很难阅读。 I know there are some editors which can beautify incomplete documents, but I want to integrate the beautifier into my own analysis tool.我知道有一些编辑器可以美化不完整的文档,但我想将美化器集成到我自己的分析工具中。

Unfortunately I did't find a discussion or solution for that case.不幸的是,我没有找到该案例的讨论或解决方案。

The nuget package GuiLabs.Language.Xml of Kirill Osenkov ( repository XmlParser ) seems to be a useful candidate for an own beautifier implementation, because it's designed to be error tolerant. Kirill Osenkov 的 nuget 包GuiLabs.Language.Xml存储库XmlParser )似乎是自己的美化器实现的有用候选者,因为它被设计为容错。 Unfortunately there is too less documentation to understand how to use this parser.不幸的是,了解如何使用此解析器的文档太少。

Example xml:示例 xml:

<?xml encoding="UTF-8"?><X><B><C>aa</C><B/><A.B><X>bb</X></A.B><A p="pp"/><nn:A>cc</nn:A><D><E>eee</

Expected result as string:预期结果为字符串:

<?xml encoding="UTF-8"?>
<X>
    <B>
        <C>aa</C>
    <B/>
    <A.B>
        <X>bb</X>
    </A.B>
    <A p="pp"/>
    <nn:A>cc</nn:A>
    <D>
        <E>eee</

Does it have to be C#?必须是 C# 吗?

In Java, you should be able to pipe the output of a SAX parser into an indenting serializer by connecting a SAXSource to a StreamResult using an identity transformer, and then just make sure that when the SAX parser aborts, you trap the exception and close the output stream tidily.在 Java 中,您应该能够通过使用身份转换器将 SAXSource 连接到 StreamResult 来将 SAX 解析器的输出通过管道传输到缩进序列化程序中,然后确保在 SAX 解析器中止时捕获异常并关闭输出流整齐。

I think you can probably do the same thing in C# but not quite as conveniently: coupling the events read from an XmlReader and sending the corresponding events to an XmlWriter is a lot more tedious because you have to write code for each separate kind of event.我认为您可能可以在 C# 中做同样的事情,但不是那么方便:耦合从 XmlReader 读取的事件并将相应的事件发送到 XmlWriter 会更加乏味,因为您必须为每种单独的事件编写代码。

If you want a C# solution and you're prepared to install Saxon enterprise edition, you can write a simple streaming transformation:如果您需要 C# 解决方案并且准备安装 Saxon 企业版,您可以编写一个简单的流转换:

<transform version="3.0" xmlns="http://www.w3.org/1999/XSL/Transform">
  <output method="xml" indent="yes"/>
  <mode streamable="yes" on-no-match="shallow-copy"/>
</transform>

invoke it from the Saxon API using XsltTransformer with a Serializer as the destination, and again, catch the exception and flush/close the output stream to which the Serializer is writing.使用 XsltTransformer 作为目标从 Saxon API 调用它,并再次捕获异常并刷新/关闭 Serializer 正在写入的输出流。

Using Saxon on Java would be overkill because the identity transformer does this "out of the box".在 Java 上使用 Saxon 会有点过分,因为身份转换器“开箱即用”。

The error ignoring "XML" parser of AngleSharp.Xml can be used to parse your sample, though missing tags will be added, you can then get an XML string representation of the built document and with the help of legacy XmlTextReader and XmlTextWriter which allow you to ignore namespaces you can at least indent the markup: AngleSharp.Xml 的忽略“XML”解析器的错误可用于解析您的示例,尽管会添加缺少的标签,但您可以获得构建文档的 XML 字符串表示,并借助传统的 XmlTextReader 和 XmlTextWriter要忽略命名空间,您至少可以缩进标记:

       var xml = @"<?xml encoding=""UTF-8""?><X><B><C>aa</C><B/><A.B><X>bb</X></A.B><A p=""pp""/><nn:A>cc</nn:A><D><E>eee</"; 

        var xmlParser = new XmlParser(new XmlParserOptions() { IsSuppressingErrors = true });

        var doc = xmlParser.ParseDocument(xml);

        Console.WriteLine(doc.ToMarkup());

        using (StringReader sr = new StringReader(doc.ToXml()))
        {
            using (XmlTextReader xr = new XmlTextReader(sr))
            {
                xr.Namespaces = false;

                using (XmlTextWriter xw = new XmlTextWriter(Console.Out))
                {
                    xw.Namespaces = false;
                    xw.Formatting = Formatting.Indented;

                    xw.WriteNode(xr, false);
                }
            }
        }
    }

eg get例如得到

<X>
  <B>
    <C>aa</C>
    <B />
    <A.B>
      <X>bb</X>
    </A.B>
    <A p="pp" />
    <nn:A>cc</nn:A>
    <D>
      <E>eee</E>
    </D>
  </B>
</X>

As your text says "Until this position the XML has a valid syntax" and your comment suggests the errors in your sample are just due to sloppiness I think it might also be possible to use WriteNode of an XmlWriter with XmlWriterSettings.Indent set to true on a standard XmlReader , as long as you catch the exception the XmlReader throws:正如您的文字所说的“直到这个位置,XML 具有有效的语法”并且您的评论表明您的示例中的错误只是由于草率我认为也可以使用XmlWriter WriteNode并将XmlWriterSettings.Indent设置为 true on标准XmlReader ,只要您捕获XmlReader抛出的异常:

        var xml = @"<?xml version=""1.0""?><root><section><p>Paragraph 1.</p><p>Paragraph 2.";

        try
        {
            using (StringReader sr = new StringReader(xml))
            {
                using (XmlReader xr = XmlReader.Create(sr))
                {
                    using (XmlWriter xw = XmlWriter.Create(Console.Out, new XmlWriterSettings() { Indent = true }))
                    {
                        xw.WriteNode(xr, false);
                    }
                }
            }
        }
        catch (XmlException e)
        {
            Console.WriteLine();
            Console.WriteLine("Malformed input XML: {0}", e.Message);
        }

gives

<?xml version="1.0"?>
<root>
  <section>
    <p>Paragraph 1.</p>
    <p>Paragraph 2.</p>
  </section>
</root>
Malformed input XML: Unexpected end of file has occurred. The following elements are not closed: p, section, root. Line 1, position 71.

So no need with WriteNode to handle every possible Readxxx and node type and call the corresponding Writexxx on the XmlWriter by you own code.所以不需要用WriteNode来处理每一个可能的Readxxx和节点类型,并通过你自己的代码在XmlWriter上调用相应的Writexxx

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM