简体   繁体   English

从XML名称标记中删除无效字符 - RegEx C#

[英]Removing Invalid Characters from XML Name Tag - RegEx C#

I have a string with xml data that I pulled from a web service. 我有一个带有xml数据的字符串,我从Web服务中提取。 The data is ugly and has some invalid chars in the Name tags of the xml. 数据很难看,并且在xml的Name标签中有一些无效的字符。 For example, I may see something like: 例如,我可能会看到类似的东西:

<Author>Scott the Coder</Author><Address#>My address</Address#>

The # in the Address name field is invalid. 地址名称字段中的#无效。 I am looking for a regular expression that will remove all the invalid chars from the name tags BUT leave all the chars in the Value section of the xml. 我正在寻找一个正则表达式,它将从名称标签中删除所有无效字符但是将所有字符保留在xml的Value部分中。 In other words, I want to use RegEx to remvove chars only from the opening name tags and closing name tags. 换句话说,我想使用RegEx仅从开头名称标签和结束名称标签中删除字符。 Everything else should remaing the same. 其他一切都应该保持相同。

I don't have all the invalid chars yet, but this will get me started: #{}&() 我还没有所有无效的字符,但这会让我开始:#{}&()

Is it possible to do what I am trying to do? 有可能做我想做的事吗?

If your intention is to only check validity of a name for a Xml node, I suggest you to take a look at the XmlConvert class; 如果你的目的只是检查Xml节点名称的有效性,我建议你看一下XmlConvert类; especially the VerifyName and VerifyNCName methods. 特别是VerifyNameVerifyNCName方法。

Also note that with that class, you could accept any text as node name using the EncodeName and EncodeLocalName methods. 另请注意,使用该类,您可以使用EncodeNameEncodeLocalName方法接受任何文本作为节点名称。

Using those methods will be far easier, safe and faster than performing a Regular Expression. 使用这些方法将比执行正则表达式更容易,更安全,更快捷。

you can use string replace to replace all invalid chracters. 您可以使用字符串替换来替换所有无效的字符。 Usually the ascii control characters will create problem in XML reading. 通常,ascii控件字符会在XML读取中产生问题。

to avoid use this function 避免使用此功能

     public static string CleanInvalidXmlChars( this string text)
    {
        // From xml spec valid chars:
        // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]    
        // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
        string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";
        return Regex.Replace(text, re, "");
    }


     xmlcontent = xmlcontent.CleanInvalidXmlChars();

this will clean chracters specified in regular expression. 这将清除正则表达式中指定的chracters。 i get this from this site 我从这个网站得到这个

I had a simple form with two text areas and one button. 我有一个简单的表单,有两个文本区域和一个按钮。 This seems to do the trick. 这似乎可以解决问题。

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Text.RegularExpressions;

namespace WindowsFormsApplication3
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, EventArgs e)
        {
            Regex r = new Regex(@"(?<=\<\w+)[#\{\}\(\)\&](?=\>)|(?<=\</\w+)[#\{\}\(\)\&](?=\>)");
            textBox2.Text = r.Replace(textBox1.Text, new MatchEvaluator(deleteMatch));
        }

        string deleteMatch(Match m) { return ""; }
    }
}

RegEx is a problematic way to go unless you really only have one file to process. 除非你真的只有一个要处理的文件,否则RegEx是一个有问题的方法。 Pain, frustration, bugs is your future there... 痛苦,挫折,错误是你的未来......

I you really want to use a RegEx, there are useful ones HERE that I have used in Perl. 要使用正则表达式,也有有用的这里 ,我在Perl已经使用。

Have you considered using a parser instead? 您是否考虑过使用解析器?

Two to consider: 两个要考虑:

LINQ for XML LINQ for XML

XmlDocument 的XmlDocument

Once parsed, you can re-save the troublesome sections or just go on your programatic way. 一旦解析,您可以重新保存麻烦的部分或只是以您的程序方式继续。

Try this: 尝试这个:

s = Regex.Replace(s, @"[#{}&()]+(?=[^<>]*>)", "");

If the lookahead succeeds, the next angle bracket after the match is a right-pointing one ( > ), which indicates that the match occurred inside a tag. 如果前瞻成功,则匹配后的下一个尖括号是右指向( > ),表示匹配发生在标记内。

Of course, this assumes the text is reasonably well-formed and that it contains no angle brackets aside from the ones in the tags. 当然,这假设文本格式合理,并且除了标签中的角度之外不包含尖括号。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM