简体   繁体   中英

Extracting and Manipulating Strings in C#.Net

We have a requirement to extract and manipulate strings in C#. Net. The requirement is - we have a string

($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:("test@test.com"))

We need to extract the strings between the character - $

Therefore, in the end, we need to get a list of strings containing - name, phonenumber, emailaddress.

What would be the ideal way to do it? are there any out of the box features available for this?

Regards,

John

The simplest way is to use a regular expression to match all non-whitespace characters between $ :

var regex=new Regex(@"\$\w+\$");
var input = "($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:(\"test@test.com\"))";

var matches=regex.Matches(input);

This will return a collection of matches. The .Value property of each match contains the matching string. \\$ is used because $ has special meaning in regular expressions - it matches the end of a string. \\w means a non-whitespace character. + means one or more.

Since this is a collection, you can use LINQ on it to get eg an array with the values:

var values=matches.OfType<Match>().Select(m=>m.Value).ToArray();

That array will contain the values $name$ , $phonenumer$ , $emailaddress$ .

Capture by name

You can specify groups in the pattern and attach names to them. For example, you can group the field name values:

var regex=new Regex(@"\$(?<name>\w+)\$");
var names=regex.Matches(input)
                .OfType<Match>()
                .Select(m=>m.Groups["name"].Value);

This will return name,phonenumer,emailaddress . Parentheses are used for grouping. (?<somename>pattern) is used to attach a name to the group

Extract both names and values

You can also capture the field values and extract them as a separate field. Once you have the field name and value, you can return them, eg as an object or anonymous type.

The pattern in this case is more comples:

@"\$(?<name>\w+)\$:\(['""](?<value>.+?)['""]\)"

Parentheses are escaped because we want them to match the values. Both ' and " characters are used in values, so ['"] is used to specify a choice of characters. The pattern is a literal string (ie starts with @) so the double quotes have to be escaped: ['""] . Any character has to be matched .+ but only up to the next character in the pattern .+? . Without the ? the pattern .+ would match everything to the end of the string.

Putting this together:

var regex =  new Regex(@"\$(?<name>\w+)\$:\(['""](?<value>.+?)['""]\)");
var myValues = regex.Matches(input)
          .OfType<Match>()
          .Select(m=>new {  Name=m.Groups["name"].Value, 
                            Value=m.Groups["value"].Value
            })
          .ToArray()

Turn them into a dictionary

Instead of ToArray() you could convert the objects to a dictionary with ToDictionary() , eg with .ToDictionary(it=>it.Name,it=>it.Value) . You could omit the select step and generate the dictionary from the matches themselves :

var myDict = regex.Matches(input)
          .OfType<Match>()
          .ToDictionary(m=>m.Groups["name"].Value, 
                        m=>m.Groups["value"].Value);

Regular expressions are generally fast because they don't split the string. The pattern is converted to efficient code that parses the input and skips non-matching input immediatelly. Each match and group contain only the index to their starting and ending character in the input string. A string is only generated when .Value is called.

Regular expressions are thread-safe, which means a single Regex object can be stored in a static field and reused from multiple threads. That helps in web applications, as there's no need to create a new Regex object for each request

Because of these two advantages, regular expressions are used extensively to parse log files and extract specific fields. Compared to splitting, performance can be 10 times better or more, while memory usage remains low. Splitting can easily result in memory usage that's multiple times bigger than the original input file.

Can it go faster?

Yes. Regular expressions produce parsing code that may not be as efficient as possible. A hand-written parser could be faster. In this particular case, we want to start capturing text if $ is detected up until the first $ . This can be done with the following method :

IEnumerable<string> GetNames(string input)
{
    var builder=new StringBuilder(20);
    bool started=false;
    foreach(var c in input)
    {        
        if (started)
        {
            if (c!='$')
            {
                builder.Append(c);
            }
            else
            {
                started=false;
                var value=builder.ToString();
                yield return value;
                builder.Clear();
            }
        }
        else if (c=='$')
        {
            started=true;
        }        
    }
}

A string is an IEnumerable<char> so we can inspect one character at a time without having to copy them. By using a single StringBuilder with a predetermined capacity we avoid reallocations, at least until we find a key that's larger than 20 characters.

Modifying this code to extract values though isn't so easy.

Here's one way to do it, but certainly not very elegant. Basically splitting the string on the '$' and taking every other item will give you the result (after some additional trimming of unwanted characters).

In this example, I'm also grabbing the value of each item and then putting both in a dictionary:

var input = "($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:(\"test@test.com\"))";
var inputParts = input.Replace(" AND ", "")
    .Trim(')', '(')
    .Split(new[] {'$'}, StringSplitOptions.RemoveEmptyEntries);

var keyValuePairs = new Dictionary<string, string>();

for (int i = 0; i < inputParts.Length - 1; i += 2)
{
    var key = inputParts[i];
    var value = inputParts[i + 1].Trim('(', ':', ')', '"', '\'', ' ');

    keyValuePairs[key] = value;
}

foreach (var kvp in keyValuePairs)
{
    Console.WriteLine($"{kvp.Key} = {kvp.Value}");
}

// Wait for input before closing
Console.WriteLine("\nDone!\nPress any key to exit...");
Console.ReadKey();

Output

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM