简体   繁体   English

使用 C#,如何关闭格式错误的 XML 标签?

[英]Using C#, how do I close malformed XML tags?

Background背景

I have inherited a load of XML files that consistently contain a tag with two openings rather than an opening and a closure.我继承了 XML 个文件,这些文件始终包含一个带有两个开口的标签,而不是一个开口和一个闭合。 I need to loop through all of these files and correct the malformed XML.我需要遍历所有这些文件并更正格式错误的 XML。

Here is a simplified example of the bad XML which is the exact same tag in every file:这是错误的 XML 的简化示例,它在每个文件中都是完全相同的标记:

<meals>
    <breakfast>
         Eggs and Toast
    </breakfast>
    <lunch>
         Salad and soup
    <lunch>
    <supper>
         Roast beef and potatoes
    </supper>
</meals>

Notice that the <lunch> tag has no closure.请注意, <lunch>标记没有关闭。 This is consistent in all of the files.这在所有文件中都是一致的。

Question

Would it be best to use regex for C# to fix this and, if so, how would I do that exactly?最好使用 C# 的regex来解决这个问题,如果是这样,我该怎么做呢?

I already know how to iterate the file system and read the docs into either an XML or string object so you don't need to answer that part.我已经知道如何迭代文件系统并将文档读入 XML 或字符串 object,因此您无需回答该部分。

Thanks!谢谢!

If your broken XML is relatively simple, as you've shown in the question, then you can get away with some simplistic logic and a basic regular expression.如果您损坏的 XML 相对简单,如您在问题中所示,那么您可以使用一些简单的逻辑和基本的正则表达式。

    public static void Main(string[] args)
    {
        string broken = @"
<meals>
    <breakfast>
         Eggs and Toast
    </breakfast>
    <lunch>
         Salad and soup
    <lunch>
    <supper>
         Roast beef and potatoes
    </supper>
</meals>";

        var pattern1 = "(?<open><(?<tag>[a-z]+)>)([^<]+?)(\\k<open>)";
        var re1 = new Regex(pattern1, RegexOptions.Singleline);

        String work = broken;
        Match match = null;
        do
        {
            match = re1.Match(work);
            if (match.Success)
            {
                Console.WriteLine("Match at position {0}.", match.Index);
                var tag = match.Groups["tag"].ToString();

                Console.WriteLine("tag: {0}", tag.ToString());

                work = work.Substring(0, match.Index) +
                    match.Value.Substring(0, match.Value.Length - tag.Length -1) +
                    "/" +
                    work.Substring(match.Index + match.Value.Length - tag.Length -1);

                Console.WriteLine("fixed: {0}", work);
            }
        } while (match.Success);
    }

That regex uses the "named" capture group feature of .NET regular expressions.该正则表达式使用 .NET 正则表达式的“命名”捕获组功能。 The ?<open> indicates that the group captured by the enclosing parens will be accessible by the name "open". ?<open>表示可以通过名称“open”访问封闭括号捕获的组。 That grouping captures the opening tag, including angle brackets.该分组捕获开始标记,包括尖括号。 It presumes there is no xml attribute on the opening tag.它假定开始标签上没有 xml 属性。 Within that grouping, there is another named group - this one uses the name "tag" and captures the tag name itself, without angle brackets.在该分组中,还有另一个命名组——这个组使用名称“tag”并捕获标签名称本身,没有尖括号。

The regex then lazily captures a bunch of intervening text ( (.+?) ), and then another "open" tag, which is specified with a back-reference.正则表达式然后懒惰地捕获一堆中间文本 ( (.+?) ),然后是另一个“打开”标记,它是用反向引用指定的。 The lazy capture is there so it doesn't slurp up any possible intervening open tag in the text.惰性捕获就在那里,因此它不会吞噬文本中任何可能的中间开放标记。

Because the XML may span multiple newlines, you need the RegexOptions.Singleline .因为 XML 可能跨越多个换行符,所以您需要RegexOptions.Singleline

The logic then applies this regex in a loop, replacing any matched text with a fixed version - valid xml with a closing tag.然后逻辑在循环中应用此正则表达式,用固定版本替换任何匹配的文本 - 有效 xml 带有结束标记。 The fixed XML is produced with simple string slicing.固定的 XML 是通过简单的字符串切片生成的。

This regex won't work if:如果出现以下情况,此正则表达式将不起作用:

  • there are XML attributes on the opening tag开始标签上有 XML 个属性
  • there is weird spacing - whitespace between the angle brackets enclosing a tag name有奇怪的间距 - 包含标签名称的尖括号之间的空格
  • the tag names use dashes or numbers or anything that is not a lowercase ASCII character标签名称使用破折号或数字或任何非小写 ASCII 字符
  • the string between includes angle brackets (in CDATA)之间的字符串包括尖括号(在 CDATA 中)

...but the approach will still work. ...但该方法仍然有效。 You just would need to tweak things a little.你只需要稍微调整一下。

I think regular expressions would be a little bit of an overkill if the situation is truly as simple as you describe it (ie, it's always the same tag, and there's always only one of them).我认为如果情况确实像您描述的那样简单(即,它始终是相同的标签,并且始终只有其中一个),那么正则表达式会有点矫枉过正。 If your XML files are relatively small (kilobytes, not megabytes), you can just load the whole thing into the memory, use string operations to insert the missing slash, and call it a day.如果您的 XML 文件相对较小(千字节,而不是兆字节),您可以将整个文件加载到 memory 中,使用字符串操作插入缺少的斜杠,然后收工。 This will be considerably more efficient (faster) than trying to use regular expressions.这将比尝试使用正则表达式更有效(更快)。 If your files are very large, you can just modify it to read in the file line-by-line until it finds the first <lunch> tag, then look for the next one and modify it accordingly.如果您的文件非常大,您可以将其修改为逐行读取文件,直到找到第一个<lunch>标记,然后查找下一个并相应地进行修改。 Here's some code for you to get started:下面是一些代码供您开始使用:

var xml = File.ReadAllText( @"C:\Path\To\NaughtyXml.xml" );

var firstLunchIdx = xml.IndexOf( "<lunch>" );
var secondLunchIdx = xml.IndexOf( "<lunch>", firstLunchIdx+1 );

var correctedXml = xml.Substring( 0, secondLunchIdx + 1 ) + "/" +
xml.Substring( secondLunchIdx + 1 );

File.WriteAllText( @"C:\Path\To\CorrectedXml.xml", correctedXml );

If the only issue within your xml files is what you have shown then Chesso's answer should suffice the need.如果您的 xml 文件中的唯一问题是您所显示的内容,那么 Chesso 的答案应该足以满足需要。 In fact, I would go that route even if it full-fills my 80-90% needs - rest of the cases, I may choose to handle manually or write specific handling code.事实上,我会 go 那条路线,即使它完全满足了我 80-90% 的需求 - rest 的情况,我可能会选择手动处理或编写特定的处理代码。

Said that, if file structure is complicated and not a simple as you describe then you should probably look at some text lexer that will allow you to break your file content into tokens.也就是说,如果文件结构很复杂而不是像您描述的那样简单,那么您可能应该看看一些文本词法分析器,它可以让您将文件内容分解为标记。 The semantic analysis of tokens to check and correct irregularities has to be done by you but at least parsing the text would be much more simpler.必须由您来完成标记的语义分析以检查和纠正违规行为,但至少解析文本会简单得多。 See few resources below that links to lexing in C#:请参阅下面链接到 C# 中词法分析的一些资源:

  1. http://blogs.msdn.com/b/drew/archive/2009/12/31/a-simple-lexer-in-c-that-uses-regular-expressions.aspx http://blogs.msdn.com/b/drew/archive/2009/12/31/a-simple-lexer-in-c-that-uses-regular-expressions.aspx
  2. Poor man's "lexer" for C# 穷人的“词法分析器”为 C#
  3. http://www.seclab.tuwien.ac.at/projects/cuplex/lex.htm http://www.seclab.tuwien.ac.at/projects/cuplex/lex.htm

It's best to avoid thinking of these as XML files: they are non-XML files.最好避免将这些视为 XML 文件:它们是非 XML 文件。 This immediately tells you that tools designed for processing XML will be no use, because the input is not XML. You need to use text-based tools.这立即告诉你,设计用于处理 XML 的工具将没有用,因为输入的不是 XML。你需要使用基于文本的工具。 On UNIX this would be things like sed/awk/perl;在 UNIX 上,这将是 sed/awk/perl 之类的东西; I've no idea what the equivalent would be on Windows.我不知道 Windows 上的等价物是什么。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM