简体   繁体   中英

Regular expression for replace whitespaces not in tags

I have plain text with some custom tags. For example:

I like C#. <code lang="C#">public static void main</code>
THis is good language.

I need replace all whitespace that's not inside a tag with &nbsp;

The text after replace must be:

I&nbsp;like&nbsp;C#.&nbsp;<code lang="C#">public static void main</code>
THis&nbsp;is&nbsp;good&nbsp;language.

If you have valid XML elements mixed with text, you can use XML parsing class, for instance XDocument, you can do it like this:

        string input = @"I like C#. <code lang=""C#"">public static void main</code>THis is good language.";
        string rootedInput = String.Format("<root>{0}</root>", input);

        XDocument doc = XDocument.Parse(rootedInput);
        var nodes = doc.Root.DescendantNodes();

        StringBuilder sb = new StringBuilder();
        string nodeAsString = String.Empty;
        foreach (XNode node in nodes)
        {
            if (node.NodeType == XmlNodeType.Text)
                nodeAsString = node.ToString().Replace(" ", "&nbsp;");
            else
                nodeAsString = node.ToString();

            sb.Append(nodeAsString);
        }

        string newStr = sb.ToString();

If tags cannot contain other tags, and there are no self closing tags or other weird stuff. This will work.

Using perl notation

s/ (?![^>]*\\<\\/)/&nbsp;/g

This also assumes that the files are well formed, and that the opening and closing tags are on the same line (but you can easily change this to multi line regex.)

Here's how it works:

Because (as you indicated) tags cannot contain other tags then at some point after your text that you don't want to replace there will be a closing tag, all closing tags start with </ . This will occur before the next opening tag. On the other hand text that you do want to replace will be followed by an opening tag before the next closing tag.

So this just matches a space and then makes a negative forward lookahead to make sure that the next time a </ appears it is before a > (or the end of an opening tag. If that's true, then the match doesn't happen and the space isn't replaced.

This will only work if tags cannot contain other tags.

A simple idea! This works:

String ConvertString(String inputString)
{
    var first = new List<string>();
    var second = new List<string>();

    foreach (Match match in Regex.Matches(inputString, "(?<inTag><code[^>]+>[^<]*</code[^>]+>)"))
    {
        first.Add(match.Groups["inTag"].Value);
    }

    inputString = inputString.Replace(" ", "&nbsp;");

    foreach (Match match in Regex.Matches(inputString, "(?<inTag><code[^>]+>[^<]*</code[^>]+>)"))
    {
        second.Add(match.Groups["inTag"].Value);
    }

    for (int i = 0; i < first.Count(); i++)
    {
        inputString = inputString.Replace(second[i], first[i]);
    }

    return inputString;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM