简体   繁体   English

删除空XML标记

[英]Remove empty XML tags

I am looking for a good approach that can remove empty tags from XML efficiently. 我正在寻找一种可以有效地从XML中删除空标签的好方法。 What do you recommend? 您有什么推荐的吗? Regex? 正则表达式? XDocument? 的XDocument? XmlTextReader? XmlTextReader的?

For example, 例如,

const string original = 
    @"<?xml version=""1.0"" encoding=""utf-16""?>
    <pet>
        <cat>Tom</cat>
        <pig />
        <dog>Puppy</dog>
        <snake></snake>
        <elephant>
            <africanElephant></africanElephant>
            <asianElephant>Biggy</asianElephant>
        </elephant>
        <tiger>
            <tigerWoods></tigerWoods>       
            <americanTiger></americanTiger>
        </tiger>
    </pet>";

Could become: 可能成为:

const string expected = 
    @"<?xml version=""1.0"" encoding=""utf-16""?>
        <pet>
        <cat>Tom</cat>
        <dog>Puppy</dog>        
        <elephant>                                              
            <asianElephant>Biggy</asianElephant>
        </elephant>                                 
    </pet>";

Loading your original into an XDocument and using the following code gives your desired output: 将原始文件加载到XDocument并使用以下代码提供所需的输出:

var document = XDocument.Parse(original);
document.Descendants()
        .Where(e => e.IsEmpty || String.IsNullOrWhiteSpace(e.Value))
        .Remove();

This is meant to be an improvement on the accepted answer to handle attributes: 这是对处理属性的已接受答案的改进:

XDocument xd = XDocument.Parse(original);
xd.Descendants()
    .Where(e => (e.Attributes().All(a => a.IsNamespaceDeclaration || string.IsNullOrWhiteSpace(a.Value))
            && string.IsNullOrWhiteSpace(e.Value)
            && e.Descendants().SelectMany(c => c.Attributes()).All(ca => ca.IsNamespaceDeclaration || string.IsNullOrWhiteSpace(ca.Value))))
    .Remove();

The idea here is to check that all attributes on an element are also empty before removing it. 这里的想法是在删除元素之前检查元素上的所有属性是否也为空。 There is also the case that empty descendants can have non-empty attributes. 还有一种情况是空后代可以具有非空属性。 I inserted a third condition to check that the element has all empty attributes among its descendants. 我插入了第三个条件来检查该元素在其后代中是否具有所有空属性。 Considering the following document with node8 added : 考虑以下添加了node8的文档:

<root>
  <node />
  <node2 blah='' adf='2'></node2>
  <node3>
    <child />
  </node3>
  <node4></node4>
  <node5><![CDATA[asdfasdf]]></node5>
  <node6 xmlns='urn://blah' d='a'/>
  <node7 xmlns='urn://blah2' />
  <node8>
     <child2 d='a' />
  </node8>
</root>

This would become: 这会变成:

<root>
  <node2 blah="" adf="2"></node2>
  <node5><![CDATA[asdfasdf]]></node5>
  <node6 xmlns="urn://blah" d="a" />
  <node8>
    <child2 d='a' />
  </node8>
</root>

The original and improved answer to this question would lose the node2 and node6 and node8 nodes. 对此问题的原始和改进的答案将丢失node2node6node8节点。 Checking for e.IsEmpty would work if you only want to strip out nodes like <node /> , but it's redunant if you're going for both <node /> and <node></node> . 如果您只想删除像<node />这样的<node /> ,那么检查e.IsEmpty会有效,但如果您要同时使用<node /><node></node> If you also need to remove empty attributes, you could do this: 如果您还需要删除空属性,则可以执行以下操作:

xd.Descendants().Attributes().Where(a => string.IsNullOrWhiteSpace(a.Value)).Remove();
xd.Descendants()
  .Where(e => (e.Attributes().All(a => a.IsNamespaceDeclaration))
            && string.IsNullOrWhiteSpace(e.Value))
  .Remove();

which would give you: 这会给你:

<root>
  <node2 adf="2"></node2>
  <node5><![CDATA[asdfasdf]]></node5>
  <node6 xmlns="urn://blah" d="a" />
</root>

As always, it depends on your requirements. 一如既往,这取决于您的要求。

Do you know how the empty tag will display? 你知道空标签会如何显示吗? (eg <pig /> , <pig></pig> , etc.) I usually do not recommend using Regular Expressions (they are really useful but at the same time they are evil). (例如<pig /><pig></pig>等)我通常不建议使用正则表达式(它们非常有用但同时它们是邪恶的)。 Also considering a string.Replace approach seems to be problematic unless your XML doesn't have a certain structure. 除非你的XML没有某种结构,否则考虑使用string.Replace方法似乎也有问题。

Finally, I would recommend using an XML parser approach (make sure your code is valid XML). 最后,我建议使用XML解析器方法(确保您的代码是有效的XML)。

var doc = XDocument.Parse(original);
var emptyElements = from descendant in doc.Descendants()
                    where descendant.IsEmpty || string.IsNullOrWhiteSpace(descendant.Value)
                    select descendant;
emptyElements.Remove();

XmlTextReader is preferable if we are talking about performance (it provides fast, forward-only access to XML). 如果我们谈论性能(它提供对XML的快速,仅向前访问),XmlTextReader是更好的选择。 You can determine if tag is empty using XmlReader.IsEmptyElement property. 您可以使用XmlReader.IsEmptyElement属性确定标记是否为空。

XDocument approach which produces desired output: XDocument方法产生所需的输出:

public static bool IsEmpty(XElement n)
{
    return n.IsEmpty 
        || (string.IsNullOrEmpty(n.Value) 
            && (!n.HasElements || n.Elements().All(IsEmpty)));
}

var doc = XDocument.Parse(original);
var emptyNodes = doc.Descendants().Where(IsEmpty);
foreach (var emptyNode in emptyNodes.ToArray())
{
    emptyNode.Remove();
}

Anything you use will have to pass through the file once at least. 你使用的任何东西都必须至少传递一次文件。 If its just a single named tag that you know then regex is your friend otherwise use a stack approach. 如果它只是一个你知道的命名标签,那么正则表达式是你的朋友,否则使用堆栈方法。 Start with parent tag and if it has a sub tag place it in stack. 从父标记开始,如果它有子标记,则将其放在堆栈中。 If you find an empty tag remove it then once you have gone through child tags and reached the ending tag of what you have on top of stack then pop it and check it as well. 如果你发现一个空标签将其删除,那么一旦你通过子标签并到达堆栈顶部的结尾标签,然后弹出并检查它。 If its empty remove it as well. 如果它是空的也删除它。 This way you can remove all empty tags including tags with empty children. 这样您就可以删除所有空标记,包括空子标记。

If you are after a reg ex expression use this 如果您正在使用reg ex表达式,请使用此选项

XDocument is probably simplest to implement, and will give adequate performance if you know your documents are reasonably small. XDocument可能最容易实现,如果您知道文档相当小,则可以提供足够的性能。

XmlTextReader will be faster and use less memory than XDocument when processing very large documents. 在处理非常大的文档时, XmlTextReader将比XDocument更快并且使用更少的内存。

Regex is best for handling text rather than XML. 正则表达式最适合处理文本而不是XML。 It might not handle all edge cases as you would like (eg a tag within a CDATA section; a tag with an xmlns attribute), so is probably not a good idea for a general implementation, but may be adequate depending on how much control you have of the input XML. 它可能无法按照您的意愿处理所有边缘情况(例如CDATA部分中的标记;具有xmlns属性的标记),因此对于一般实现可能不是一个好主意,但可能是足够的,具体取决于您控制多少拥有输入XML。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM