简体   繁体   中英

Remove CDATA from the input

I get a string which has CDATA and I want to remove that.

Input : "<Text><![CDATA[Hello]]></Text><Text><![CDATA[World]]></Text>"
Output I want : <text>Hello</text> 
              <text>World</text>

I want to take all data between <text> and </text> and add it to a list.

The code I try is :

private List<XElement> Foo(string input)
{
    string pattern = "<text>(.*?)</text>";
    input = "<Text><![CDATA[Hello]]></Text><Text><![CDATA[World]]></Text>" //For Testing
    var matches = Regex.Matches(input, pattern, RegexOptions.IgnoreCase);
    var a = matches.Cast<Match>().Select(m => m.Groups[1].Value.Trim()).ToArray();

    List<XElement> li = new List<XElement>();
    XElement xText;
    for (int i = 0; i < a.Length; i++)
    {
        xText = new XElement("text");
        xText.Add(System.Net.WebUtility.HtmlDecode(a[i]));
        li.Add(xText);
    }
    return li;
} 

But, Here I get output as :

<text>&lt;![CDATA[Hello]]&gt;</text>
<text>&lt;![CDATA[World]]&gt;</text>

Can anyone please help me up.

It seems to me that you shouldn't be using a regular expression at all. Instead, construct a valid XML document be wrapping it all in a root element, then parse it and extract the elements you want.

You also want to replace all CDATA nodes with their equivalent text nodes. You can do that before or after you extract the elements into a list, but I've chosen to do it before:

using System;
using System.Linq;
using System.Xml.Linq;

class Test
{
    static void Main()
    {
        string input = "<Text><![CDATA[Hello]]></Text><Text><![CDATA[World]]></Text>";
        string xml = "<root>" + input + "</root>";
        var doc = XDocument.Parse(xml);
        var nodes = doc.DescendantNodes().OfType<XCData>().ToList();
        foreach (var node in nodes)
        {
            node.ReplaceWith(new XText(node.Value));
        }
        var elements = doc.Root.Elements().ToList();
        elements.ForEach(Console.WriteLine);
    }
}

I would use XDocument instead of Regex:

var value = "<root><Text><![CDATA[Hello]]></Text><Text><![CDATA[World]]></Text></root>";
var doc = XDocument.Parse(value);
Console.WriteLine (doc.Root.Elements().ElementAt(0).Value);
Console.WriteLine (doc.Root.Elements().ElementAt(1).Value);

Ouput:

Hello World

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM