简体   繁体   English

在格式化字符串中提取一些值

[英]Extract some values in formatted string

I would like to retrieve values in string formatted like this : 我想以字符串格式检索值,如下所示:

public var any:int = 0;
public var anyId:Number = 2;
public var theEnd:Vector.<uint>;
public var test:Boolean = false;
public var others1:Vector.<int>;
public var firstValue:CustomType;
public var field2:Boolean = false;
public var secondValue:String = "";
public var isWorks:Boolean = false;

I want to store field name, type and value in a custom class Property : 我想在自定义类Property中存储字段名称,类型和值:

public class Property
{
    public string Name { get; set; }
    public string Type { get; set; }
    public string Value { get; set; }
}

And with a Regex expression get these values. 并使用Regex表达式获取这些值。

How can I do ? 我能怎么做 ?

Thanks 谢谢

EDIT : I tried this but I don't know how to go further with vectors..etc 编辑:我试过这个,但我不知道如何进一步使用vectors..etc

    /public var ([a-zA-Z0-9]*):([a-zA-Z0-9]*)( = \"?([a-zA-Z0-9]*)\"?)?;/g

Ok, posting my regex-based answer. 好的,发布我的基于正则表达式的答案。

Your regex - /public var ([a-zA-Z0-9]*):([a-zA-Z0-9]*)( = \\"?([a-zA-Z0-9]*)\\"?)?;/g - contains regex delimiters, and they are not supported in C#, and thus are treated as literal symbols. 你的正则表达式 - /public var ([a-zA-Z0-9]*):([a-zA-Z0-9]*)( = \\"?([a-zA-Z0-9]*)\\"?)?;/g ?; /public var ([a-zA-Z0-9]*):([a-zA-Z0-9]*)( = \\"?([a-zA-Z0-9]*)\\"?)?;/g - 包含正则表达式分隔符,它们在C#中不受支持,因此被视为文字符号。 You need to remove them and the modifier g since to obtain multiple matches in C# Regex.Matches , or Regex.Match with while and Match.Success / .NextMatch() can be used. 您需要将其删除和改性剂g因为以获得多个匹配在C# Regex.Matches ,或Regex.MatchwhileMatch.Success / .NextMatch()都可以使用。

The regex I am using is (?<=\\s*var\\s*)(?<name>[^=:\\n]+):(?<type>[^;=\\n]+)(?:=(?<value>[^;\\n]+))? 我正在使用的正则表达式是(?<=\\s*var\\s*)(?<name>[^=:\\n]+):(?<type>[^;=\\n]+)(?:=(?<value>[^;\\n]+))? . The newline symbols are included as negated character classes can match a newline character. 包含换行符号,因为否定字符类可以匹配换行符。

var str = "public var any:int = 0;\r\npublic var anyId:Number = 2;\r\npublic var theEnd:Vector.<uint>;\r\npublic var test:Boolean = false;\r\npublic var others1:Vector.<int>;\r\npublic var firstValue:CustomType;\r\npublic var field2:Boolean = false;\r\npublic var secondValue:String = \"\";\r\npublic var isWorks:Boolean = false;";
var rx = new Regex(@"(?<=\s*var\s*)(?<name>[^=:\n]+):(?<type>[^;=\n]+)(?:=(?<value>[^;\n]+))?");
var coll = rx.Matches(str);
var props = new List<Property>();
foreach (Match m in coll)
    props.Add(new Property(m.Groups["name"].Value,m.Groups["type"].Value, m.Groups["value"].Value));
foreach (var item in props)
    Console.WriteLine("Name = " + item.Name + ", Type = " + item.Type + ", Value = " + item.Value);

Or with LINQ: 或者使用LINQ:

var props = rx.Matches(str)
          .OfType<Match>()
          .Select(m => 
               new Property(m.Groups["name"].Value, 
                   m.Groups["type"].Value, 
                   m.Groups["value"].Value))
          .ToList();

And the class example: 和班级的例子:

public class Property
{
    public string Name { get; set; }
    public string Type { get; set; }
    public string Value { get; set; }
    public Property()
    {}
    public Property(string n, string t, string v)
    {
        this.Name = n;  
        this.Type = t;
        this.Value = v;
    }
}

NOTE ON PERFORMANCE : 关于性能的说明

The regex is not the quickest, but it certainly beats the one in the other answer. 正则表达式不是最快的,但它肯定胜过另一个答案中的正则表达式。 Here is a test performed at regexhero.net : 这是在regexhero.net上执行的测试:

在此输入图像描述

It seems, that you don't want regular expressions ; 看来,你不想要正则表达式 ; in a simple case as you've provided: 在您提供的简单案例中:

  String text =
    @"public var any:int = 0;
      public var anyId:Number = 2;
      public var theEnd:Vector.<uint>;
      public var test:Boolean = false;
      public var others1:Vector.<int>;
      public var firstValue:CustomType;
      public var field2:Boolean = false;";

  List<Property> result = text
    .Split(new Char[] {'\r','\n'}, StringSplitOptions.RemoveEmptyEntries)
    .Select(line => {
       int varIndex = line.IndexOf("var") + "var".Length;
       int columnIndex = line.IndexOf(":") + ":".Length;
       int equalsIndex = line.IndexOf("="); // + "=".Length;
       // '=' can be absent
       equalsIndex = equalsIndex < 0 ? line.Length : equalsIndex + "=".Length;

       return new Property() {
         Name = line.Substring(varIndex, columnIndex - varIndex - 1).Trim(),
         Type = line.Substring(columnIndex, columnIndex - varIndex - 1).Trim(),
         Value = line.Substring(equalsIndex).Trim(' ', ';')
       };
    })
    .ToList();

if text can contain comments and other staff, eg 如果文本可以包含评论和其他人员,例如

  "public (*var is commented out*) var sample: int = 123;;;; // another comment"

you have to implement a parser 你必须实现一个解析器

You can use the following pattern: 您可以使用以下模式:

\s*(?<vis>\w+?)\s+var\s+(?<name>\w+?)\s*:\s*(?<type>\S+?)(\s*=\s*(?<value>\S+?))?\s*;

to match each element in a line. 匹配一行中的每个元素。 Appending ? 追加? after a quantifier results in a non-greedy match which makes the pattern a lot simpler - no need to negate all unwanted classes. 在量词导致非贪婪的匹配之后,这使得模式更加简单 - 不需要否定所有不需要的类。

Values are optional, so the value group is wrapped in another, optional group (\\s*=\\s*(?<value>\\S+?))? 值是可选的,因此值组包装在另一个可选组中(\\s*=\\s*(?<value>\\S+?))?

Using the RegexOptions.Multiline option means we don't have to worry about accidentally matching newlines. 使用RegexOptions.Multiline选项意味着我们不必担心意外匹配换行符。

The C# 6 syntax in the following example isn't required, but multiline string literals and interpolated strings make for much cleaner code. 以下示例中的C#6语法不是必需的,但是多行字符串文字和插值字符串可以实现更清晰的代码。

var input= @"public var any:int = 0;
            public var anyId:Number = 2;
            public var theEnd:Vector.<uint>;
            public var test:Boolean = false;
            public var others1:Vector.<int>;
            public var firstValue:CustomType;
            public var field2:Boolean = false;
            public var secondValue:String = """";
            public var isWorks:Boolean = false;";

var pattern= @"\s*(?<vis>\w+?)\s+var\s+(?<name>\w+?)\s*:\s*(?<type>\S+?)(\s*=\s*(?<value>\S+?))?\s*;"
var regex = new Regex(pattern, RegexOptions.Multiline);
var results=regex.Matches(input);
foreach (Match m in results)
{
    var g = m.Groups;
    Console.WriteLine($"{g["name"],-15} {g["type"],-10} {g["value"],-10}");
}

var properties = (from m in results.OfType<Match>()
                    let g = m.Groups
                    select new Property
                    {
                        Name = g["name"].Value,
                        Type = g.["type"].Value,
                        Value = g["value"].Value
                    })
                    .ToList();

I would consider using a parser generator like ANTLR though, if I had to parse more complex input or if there are multiple patterns to match. 我会考虑使用像ANTLR这样的解析器生成器,如果我必须解析更复杂的输入或者有多个模式匹配。 Learning how to write the grammar takes some time, but once you learn it, it's easy to create parsers that can match input that would require very complicated regular expressions. 学习如何编写语法需要一些时间,但是一旦你学会了它,就很容易创建能够匹配需要非常复杂的正则表达式的输入的解析器。 Whitespace management also becomes a lot easier 空白管理也变得容易多了

In this case, the grammar could be something like: 在这种情况下,语法可能是这样的:

property   : visibility var name COLON type (EQUALS value)? SEMICOLON;
visibility : ALPHA+;
var        : ALPHA ALPHA ALPHA;
name       : ALPHANUM+;
type       : (ALPHANUM|DOT|LEFT|RIGHT);
value      : ALPHANUM
           | literal;
literal    : DOUBLE_QUOTE ALPHANUM* DOUBLE_QUOTE;

ALPHANUM   : ALPHA
           | DIGIT;
ALPHA      : [A-Z][a-z];
DIGIT      : [0-9];
...
WS         : [\r\n\s] -> skip;

With a parser, adding eg comments would be as simple as adding comment before SEMICOLON in the property rule and a new comment rule that would match the pattern of a comment 使用解析器,添加例如注释SEMICOLONproperty规则中的SEMICOLON之前添加comment一样简单,以及与comment模式匹配的新comment规则

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM