简体   繁体   中英

Removing Invalid Characters from XML Name Tag - RegEx C#

I have a string with xml data that I pulled from a web service. The data is ugly and has some invalid chars in the Name tags of the xml. For example, I may see something like:

<Author>Scott the Coder</Author><Address#>My address</Address#>

The # in the Address name field is invalid. I am looking for a regular expression that will remove all the invalid chars from the name tags BUT leave all the chars in the Value section of the xml. In other words, I want to use RegEx to remvove chars only from the opening name tags and closing name tags. Everything else should remaing the same.

I don't have all the invalid chars yet, but this will get me started: #{}&()

Is it possible to do what I am trying to do?

If your intention is to only check validity of a name for a Xml node, I suggest you to take a look at the XmlConvert class; especially the VerifyName and VerifyNCName methods.

Also note that with that class, you could accept any text as node name using the EncodeName and EncodeLocalName methods.

Using those methods will be far easier, safe and faster than performing a Regular Expression.

you can use string replace to replace all invalid chracters. Usually the ascii control characters will create problem in XML reading.

to avoid use this function

     public static string CleanInvalidXmlChars( this string text)
    {
        // From xml spec valid chars:
        // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]    
        // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
        string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";
        return Regex.Replace(text, re, "");
    }


     xmlcontent = xmlcontent.CleanInvalidXmlChars();

this will clean chracters specified in regular expression. i get this from this site

I had a simple form with two text areas and one button. This seems to do the trick.

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Text.RegularExpressions;

namespace WindowsFormsApplication3
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, EventArgs e)
        {
            Regex r = new Regex(@"(?<=\<\w+)[#\{\}\(\)\&](?=\>)|(?<=\</\w+)[#\{\}\(\)\&](?=\>)");
            textBox2.Text = r.Replace(textBox1.Text, new MatchEvaluator(deleteMatch));
        }

        string deleteMatch(Match m) { return ""; }
    }
}

RegEx is a problematic way to go unless you really only have one file to process. Pain, frustration, bugs is your future there...

I you really want to use a RegEx, there are useful ones HERE that I have used in Perl.

Have you considered using a parser instead?

Two to consider:

LINQ for XML

XmlDocument

Once parsed, you can re-save the troublesome sections or just go on your programatic way.

Try this:

s = Regex.Replace(s, @"[#{}&()]+(?=[^<>]*>)", "");

If the lookahead succeeds, the next angle bracket after the match is a right-pointing one ( > ), which indicates that the match occurred inside a tag.

Of course, this assumes the text is reasonably well-formed and that it contains no angle brackets aside from the ones in the tags.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM