简体   繁体   中英

c# remove (null) from XML tags

I need to figure out a good way using C# to parse an XML file for (NULL) and remove it from the tags and replace it with the word BAD .

For example:

<GC5_(NULL) DIRTY="False"></GC5_(NULL)>

should be replaced with

<GC5_BAD DIRTY="False"></GC5_BAD>

Part of the problem is I have no control over the original XML, I just need to fix it once I receive it. The second problem is that the (NULL) can appear in zero, one, or many tags. It appears to be an issue with users filling in additional fields or not. So I might get

<GC5_(NULL) DIRTY="False"></GC5_(NULL)>

or

<MH_OTHSECTION_TXT_(NULL) DIRTY="False"></MH_OTHSECTION_TXT_(NULL)>

or

<LCDATA_(NULL) DIRTY="False"></LCDATA_(NULL)>

I am a newbie to C# and programming.

EDIT: So I have come up with the following function that while not pretty, so far work.

public static string CleanInvalidXmlChars(string fileText)
    {
        List<char> charsToSubstitute = new List<char>();
        charsToSubstitute.Add((char)0x19);
        charsToSubstitute.Add((char)0x1C);
        charsToSubstitute.Add((char)0x1D);
        foreach (char c in charsToSubstitute)
            fileText = fileText.Replace(Convert.ToString(c), string.Empty);

        StringBuilder b = new StringBuilder(fileText);
        b.Replace("&#x0;", string.Empty);
        b.Replace("&#x1C;", string.Empty);
        b.Replace("<(null)", "<BAD");
        b.Replace("(null)>", "BAD>");

        Regex nullMatch = new Regex("<(.+?)_\\(NULL\\)(.+?)>");
        String result = nullMatch.Replace(b.ToString(), "<$1_BAD$2>");

        result = result.Replace("(NULL)", "BAD");

        return result;
    }

I have only been able to find 6 or 7 bad XML files to test this code on, but it has worked on each of them and not removed good data. I appreciate the feedback and your time.

In general, regular expressions are not the right way of handling XML files. There's a range of solutions to handle XML files correctly - you can read up on System.Xml.Linq for a good start. If you're a newbie, it's certainly something you should learn at some point. As Ed Plunkett pointed out in the comments, though, your XML is not actually XML: ( and ) characters are not allowed in XML element names.

Since you will have to do it as an operation on a string, Corak's comment to use

contentOfXml.Replace("(NULL)", "BAD");

may be a good idea, but will break if any elements can contain the string (NULL) as anything other than their name.

If you want a regex approach, this might work decently, but I'm not sure if it's not missing any edge cases:

var regex = new Regex(@"(<\/?[^_]*_)\(NULL\)([^>]*>)");
var result = regex.Replace(contentOfXml, "$1BAD$2");

Will it be suitable for you to read this XML as a string and perform a regex replacement? Like:

Regex nullMatch = new Regex("<(.+?)_\\(NULL\\)(.+?)>");
String processedXmlString = nullMatch.Replace(originalXmlString, "<$1_BAD$2>");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM