简体   繁体   中英

Remove all CDATA nodes and replace with encoded text

So, I've got a massive XML file and I want to remove all CDATA sections and replace the CDATA node contents with safe, html encoded text nodes.

Just stripping out the CDATA with a regex will of course break the parsing. Is there a LINQ or XmlDocument or XmlTextWriter technique to swap out the CDATA with encoded text?

I'm not too concerned with the final encoding quite yet, just how to replace the sections with the encoding of my choice.

Original Example

  ---
  <COLLECTION type="presentation" autoplay="false">
    <TITLE><![CDATA[Rights & Responsibilities]]></TITLE>
    <ITEM id="2802725d-dbac-e011-bcd6-005056af18ff" presenterGender="male">
      <TITLE><![CDATA[Watch the demo]]></TITLE>
      <LINK><![CDATA[_assets/2302725d-dbac-e011-bcd6-005056af18ff/presentation/presentation-00000000.mp4]]></LINK>
    </ITEM>
  </COLLECTION>
  ---

Sould Become

          <COLLECTION type="presentation" autoplay="false">
            <TITLE>Rights &amp; Responsibilities</TITLE>
            <ITEM id="2802725d-dbac-e011-bcd6-005056af18ff" presenterGender="male">
              <TITLE>Watch the demo</TITLE>
              <LINK>_assets/2302725d-dbac-e011-bcd6-005056af18ff/presentation/presentation-00000000.mp4</LINK>
            </ITEM>
          </COLLECTION>

I guess the ultimate goal is to move to JSON. I've tried this

            XmlDocument doc = new XmlDocument();
            doc.Load(Server.MapPath( @"~/somefile.xml"));
            string jsonText = JsonConvert.SerializeXmlNode(doc);

But I end up with ugly nodes, ie "#cdata-section" keys. It would take WAAAAY to many hours to have the front end re-developed to accept this.

"COLLECTION":[{"@type":"whitepaper","TITLE":{"#cdata-section":"SUPPORTING DOCUMENTS"}},{"@type":"presentation","@autoplay":"false","TITLE":{"#cdata-section":"Demo Presentation"},"ITEM":{"@id":"2802725d-dbac-e011-bcd6-005056af18ff","@presenterGender":"male","TITLE":{"#cdata-section":"Watch the demo"},"LINK":{"#cdata-section":"_assets/2302725d-dbac-e011-bcd6-005056af18ff/presentation/presentation-00000000.mp4"}

Process the XML with a XSLT that just copies input to output - C# code:

  XslCompiledTransform transform = new XslCompiledTransform();
  transform.Load(@"c:\temp\id.xslt");
  transform.Transform(@"c:\temp\cdata.xml", @"c:\temp\clean.xml");

id.xslt:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="yes"/>

    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

Using LINQ to XML, you can do it like this:

XDocument doc = …;

var cDataNodes = doc.DescendantNodes().OfType<XCData>().ToArray();

foreach (var cDataNode in cDataNodes)
    cDataNode.ReplaceWith(new XText(cDataNode));

I think you can load the xml into a XmlDocument class. Then recursively process each XmlNode and look for XmlCDataSection node. This XmlCDataSection node should be replaced with XmlTextNode with same value.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM