Remove all CDATA nodes and replace with encoded text

Question

So, I've got a massive XML file and I want to remove all CDATA sections and replace the CDATA node contents with safe, html encoded text nodes.

Just stripping out the CDATA with a regex will of course break the parsing. Is there a LINQ or XmlDocument or XmlTextWriter technique to swap out the CDATA with encoded text?

I'm not too concerned with the final encoding quite yet, just how to replace the sections with the encoding of my choice.

Original Example

  ---
  <COLLECTION type="presentation" autoplay="false">
    <TITLE><![CDATA[Rights & Responsibilities]]></TITLE>
    <ITEM id="2802725d-dbac-e011-bcd6-005056af18ff" presenterGender="male">
      <TITLE><![CDATA[Watch the demo]]></TITLE>
      <LINK><![CDATA[_assets/2302725d-dbac-e011-bcd6-005056af18ff/presentation/presentation-00000000.mp4]]></LINK>
    </ITEM>
  </COLLECTION>
  ---

Sould Become

          <COLLECTION type="presentation" autoplay="false">
            <TITLE>Rights &amp; Responsibilities</TITLE>
            <ITEM id="2802725d-dbac-e011-bcd6-005056af18ff" presenterGender="male">
              <TITLE>Watch the demo</TITLE>
              <LINK>_assets/2302725d-dbac-e011-bcd6-005056af18ff/presentation/presentation-00000000.mp4</LINK>
            </ITEM>
          </COLLECTION>

I guess the ultimate goal is to move to JSON. I've tried this

            XmlDocument doc = new XmlDocument();
            doc.Load(Server.MapPath( @"~/somefile.xml"));
            string jsonText = JsonConvert.SerializeXmlNode(doc);

But I end up with ugly nodes, ie "#cdata-section" keys. It would take WAAAAY to many hours to have the front end re-developed to accept this.

"COLLECTION":[{"@type":"whitepaper","TITLE":{"#cdata-section":"SUPPORTING DOCUMENTS"}},{"@type":"presentation","@autoplay":"false","TITLE":{"#cdata-section":"Demo Presentation"},"ITEM":{"@id":"2802725d-dbac-e011-bcd6-005056af18ff","@presenterGender":"male","TITLE":{"#cdata-section":"Watch the demo"},"LINK":{"#cdata-section":"_assets/2302725d-dbac-e011-bcd6-005056af18ff/presentation/presentation-00000000.mp4"}

Answer 1

Process the XML with a XSLT that just copies input to output - C# code:

  XslCompiledTransform transform = new XslCompiledTransform();
  transform.Load(@"c:\temp\id.xslt");
  transform.Transform(@"c:\temp\cdata.xml", @"c:\temp\clean.xml");

id.xslt:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="yes"/>

    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

Answer 2

Using LINQ to XML, you can do it like this:

XDocument doc = …;

var cDataNodes = doc.DescendantNodes().OfType<XCData>().ToArray();

foreach (var cDataNode in cDataNodes)
    cDataNode.ReplaceWith(new XText(cDataNode));

Answer 3

I think you can load the xml into a XmlDocument class. Then recursively process each XmlNode and look for XmlCDataSection node. This XmlCDataSection node should be replaced with XmlTextNode with same value.

Remove all CDATA nodes and replace with encoded text

Question

3 answers

solution1
5 ACCPTED 2012-05-10 23:38:37

solution2
4 2012-05-10 23:56:44

solution3
0 2012-05-10 23:47:47

Remove all CDATA nodes and replace with encoded text

Question

3 answers

solution1 5 ACCPTED 2012-05-10 23:38:37

solution2 4 2012-05-10 23:56:44

solution3 0 2012-05-10 23:47:47

solution1
5 ACCPTED 2012-05-10 23:38:37

solution2
4 2012-05-10 23:56:44

solution3
0 2012-05-10 23:47:47