简体   繁体   中英

Extract some values in formatted string

I would like to retrieve values in string formatted like this :

public var any:int = 0;
public var anyId:Number = 2;
public var theEnd:Vector.<uint>;
public var test:Boolean = false;
public var others1:Vector.<int>;
public var firstValue:CustomType;
public var field2:Boolean = false;
public var secondValue:String = "";
public var isWorks:Boolean = false;

I want to store field name, type and value in a custom class Property :

public class Property
{
    public string Name { get; set; }
    public string Type { get; set; }
    public string Value { get; set; }
}

And with a Regex expression get these values.

How can I do ?

Thanks

EDIT : I tried this but I don't know how to go further with vectors..etc

    /public var ([a-zA-Z0-9]*):([a-zA-Z0-9]*)( = \"?([a-zA-Z0-9]*)\"?)?;/g

Ok, posting my regex-based answer.

Your regex - /public var ([a-zA-Z0-9]*):([a-zA-Z0-9]*)( = \\"?([a-zA-Z0-9]*)\\"?)?;/g - contains regex delimiters, and they are not supported in C#, and thus are treated as literal symbols. You need to remove them and the modifier g since to obtain multiple matches in C# Regex.Matches , or Regex.Match with while and Match.Success / .NextMatch() can be used.

The regex I am using is (?<=\\s*var\\s*)(?<name>[^=:\\n]+):(?<type>[^;=\\n]+)(?:=(?<value>[^;\\n]+))? . The newline symbols are included as negated character classes can match a newline character.

var str = "public var any:int = 0;\r\npublic var anyId:Number = 2;\r\npublic var theEnd:Vector.<uint>;\r\npublic var test:Boolean = false;\r\npublic var others1:Vector.<int>;\r\npublic var firstValue:CustomType;\r\npublic var field2:Boolean = false;\r\npublic var secondValue:String = \"\";\r\npublic var isWorks:Boolean = false;";
var rx = new Regex(@"(?<=\s*var\s*)(?<name>[^=:\n]+):(?<type>[^;=\n]+)(?:=(?<value>[^;\n]+))?");
var coll = rx.Matches(str);
var props = new List<Property>();
foreach (Match m in coll)
    props.Add(new Property(m.Groups["name"].Value,m.Groups["type"].Value, m.Groups["value"].Value));
foreach (var item in props)
    Console.WriteLine("Name = " + item.Name + ", Type = " + item.Type + ", Value = " + item.Value);

Or with LINQ:

var props = rx.Matches(str)
          .OfType<Match>()
          .Select(m => 
               new Property(m.Groups["name"].Value, 
                   m.Groups["type"].Value, 
                   m.Groups["value"].Value))
          .ToList();

And the class example:

public class Property
{
    public string Name { get; set; }
    public string Type { get; set; }
    public string Value { get; set; }
    public Property()
    {}
    public Property(string n, string t, string v)
    {
        this.Name = n;  
        this.Type = t;
        this.Value = v;
    }
}

NOTE ON PERFORMANCE :

The regex is not the quickest, but it certainly beats the one in the other answer. Here is a test performed at regexhero.net :

在此输入图像描述

It seems, that you don't want regular expressions ; in a simple case as you've provided:

  String text =
    @"public var any:int = 0;
      public var anyId:Number = 2;
      public var theEnd:Vector.<uint>;
      public var test:Boolean = false;
      public var others1:Vector.<int>;
      public var firstValue:CustomType;
      public var field2:Boolean = false;";

  List<Property> result = text
    .Split(new Char[] {'\r','\n'}, StringSplitOptions.RemoveEmptyEntries)
    .Select(line => {
       int varIndex = line.IndexOf("var") + "var".Length;
       int columnIndex = line.IndexOf(":") + ":".Length;
       int equalsIndex = line.IndexOf("="); // + "=".Length;
       // '=' can be absent
       equalsIndex = equalsIndex < 0 ? line.Length : equalsIndex + "=".Length;

       return new Property() {
         Name = line.Substring(varIndex, columnIndex - varIndex - 1).Trim(),
         Type = line.Substring(columnIndex, columnIndex - varIndex - 1).Trim(),
         Value = line.Substring(equalsIndex).Trim(' ', ';')
       };
    })
    .ToList();

if text can contain comments and other staff, eg

  "public (*var is commented out*) var sample: int = 123;;;; // another comment"

you have to implement a parser

You can use the following pattern:

\s*(?<vis>\w+?)\s+var\s+(?<name>\w+?)\s*:\s*(?<type>\S+?)(\s*=\s*(?<value>\S+?))?\s*;

to match each element in a line. Appending ? after a quantifier results in a non-greedy match which makes the pattern a lot simpler - no need to negate all unwanted classes.

Values are optional, so the value group is wrapped in another, optional group (\\s*=\\s*(?<value>\\S+?))?

Using the RegexOptions.Multiline option means we don't have to worry about accidentally matching newlines.

The C# 6 syntax in the following example isn't required, but multiline string literals and interpolated strings make for much cleaner code.

var input= @"public var any:int = 0;
            public var anyId:Number = 2;
            public var theEnd:Vector.<uint>;
            public var test:Boolean = false;
            public var others1:Vector.<int>;
            public var firstValue:CustomType;
            public var field2:Boolean = false;
            public var secondValue:String = """";
            public var isWorks:Boolean = false;";

var pattern= @"\s*(?<vis>\w+?)\s+var\s+(?<name>\w+?)\s*:\s*(?<type>\S+?)(\s*=\s*(?<value>\S+?))?\s*;"
var regex = new Regex(pattern, RegexOptions.Multiline);
var results=regex.Matches(input);
foreach (Match m in results)
{
    var g = m.Groups;
    Console.WriteLine($"{g["name"],-15} {g["type"],-10} {g["value"],-10}");
}

var properties = (from m in results.OfType<Match>()
                    let g = m.Groups
                    select new Property
                    {
                        Name = g["name"].Value,
                        Type = g.["type"].Value,
                        Value = g["value"].Value
                    })
                    .ToList();

I would consider using a parser generator like ANTLR though, if I had to parse more complex input or if there are multiple patterns to match. Learning how to write the grammar takes some time, but once you learn it, it's easy to create parsers that can match input that would require very complicated regular expressions. Whitespace management also becomes a lot easier

In this case, the grammar could be something like:

property   : visibility var name COLON type (EQUALS value)? SEMICOLON;
visibility : ALPHA+;
var        : ALPHA ALPHA ALPHA;
name       : ALPHANUM+;
type       : (ALPHANUM|DOT|LEFT|RIGHT);
value      : ALPHANUM
           | literal;
literal    : DOUBLE_QUOTE ALPHANUM* DOUBLE_QUOTE;

ALPHANUM   : ALPHA
           | DIGIT;
ALPHA      : [A-Z][a-z];
DIGIT      : [0-9];
...
WS         : [\r\n\s] -> skip;

With a parser, adding eg comments would be as simple as adding comment before SEMICOLON in the property rule and a new comment rule that would match the pattern of a comment

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM