简体   繁体   English

使用正则表达式在C#扩展方法中解析字符串

[英]Parse string in C# extension method with regex

I need to create extension method which pars(split) my string. 我需要创建解析(分割)我的字符串的扩展方法。

For example: If I have string 例如:如果我有字符串

COMMAND 1 PROCESSED "JOB command" 20160801 09:05:24 命令1已处理“作业命令” 20160801 09:05:24

It should be split like this 应该这样分割

COMMAND 命令

1 1个

PROCESSED 处理

"JOB command" “ JOB命令”

20160801 20160801

09:05:24 09:05:24

Other example. 其他例子。 If I have string: 如果我有字符串:

COMMAND 2 ERROR 06 00000032 "Message window is still active." 命令2错误06 00000032“消息窗口仍处于活动状态。” 20160801 09:05:24 20160801 09:05:24

It should be split like this: 应该这样分割:

COMMAND 命令

2 2

ERROR 错误

06 06

00000032 00000032

"Message window is still active." “消息窗口仍处于活动状态。”

20160801 09:05:24 20160801 09:05:24

I have solution for this. 我对此有解决方案。 But I am sure that there is much cleaner solution. 但是,我确信有很多更清洁的解决方案。

My solution: 我的解决方案:

 public static List<string> GetTokens(this string line)
        {
            // TODO: Code refactoring:
            var res = new List<string>();
            var parts = Regex.Split(line, "/[^\\s\"']+|\"([^\"]*)\"|'([^']*)'/g");

            var subParts = parts[0].Split(' ');
            foreach (var val in subParts)
            {
                res.Add(val);
            }
            res.Add(parts[1]);
            subParts = parts[2].Split(' ');
            foreach (var val in subParts)
            {
                res.Add(val);
            }

            res.RemoveAll(f => f.Trim() == "");
            return res;
        }

I would like to implement cleaner solution. 我想实施更清洁的解决方案。 Any ideas? 有任何想法吗?

I suggest implementing a simple loop instead of complex regular expression : 我建议实现一个简单的循环而不是复杂的正则表达式

public static IEnumerable<String> GetTokens(this string value) {
  if (string.IsNullOrEmpty(value))
    yield break; // or throw exception in case of value == null

  bool inQuotation = false;
  int index = 0;

  for (int i = 0; i < value.Length; ++i) {
    char ch = value[i];

    if (ch == '"')
      inQuotation = !inQuotation;
    else if ((ch == ' ') && (!inQuotation)) {
      yield return value.Substring(index, i - index);

      index = i + 1;
    }
  }

  if (index < value.Length)
    yield return value.Substring(index, value.Length - index);
}

Test 测试

var source = 
  "COMMAND 2 ERROR 06 00000032 \"Message window is still active.\" 20160801 09:05:24";

Console.Write(string.Join(Environment.NewLine, GetTokens(source)));

Output 输出量

 COMMAND
 2
 ERROR
 06
 00000032
 "Message window is still active."
 20160801
 09:05:24

Edit : in case you want two quotation types with " (double) as well as ' (single): 编辑 :如果您想要两个带" (双)和' (单)的报价类型:

public static IEnumerable<String> GetTokens(string value) {
  if (string.IsNullOrEmpty(value))
    yield break;

  bool inQuotation = false;
  bool inApostroph = false;

  int index = 0;

  for (int i = 0; i < value.Length; ++i) {
    char ch = value[i];

    if (inQuotation) 
      inQuotation = ch != '"';
    else if (inApostroph) 
      inApostroph = ch != '\'';
    else if (ch == '"')
      inQuotation = true;
    else if (ch == '\'')
      inApostroph = true;
    else if ((ch == ' ') && (!inQuotation)) {
      yield return value.Substring(index, i - index);

      index = i + 1;
    }
  }

  if (index < value.Length)
    yield return value.Substring(index, value.Length - index);
}

After a while a figured out some simple code: 过了一会儿,想出了一些简单的代码:

public static List<string> GetTokens(this string line)
{
    return Regex.Matches(line, @"([^\s""]+|""([^""]*)"")").OfType<Match>().Select(l => l.Groups[1].Value).ToList();
}

I tested the code with a MessageBox which showed the List with | 我用一个MessageBox测试了代码,该MessageBox显示了带有|List | in-between each item: 在每个项目之间:

在此处输入图片说明

You can use regex like : ([^\\s"]+|"[^"]*") with globlal identifier 您可以使用带有(globlal)标识符的正则表达式,如: ([^\\s"]+|"[^"]*")

Demo and Explaination 演示与讲解

A pure regex solution: 纯正则表达式解决方案:

public static List<string> GetTokens(this string line)
{
    return Regex.Matches(line,
        @""".*?""|\S+").Cast<Match>().Select(m => m.Value).ToList();
}

The ".*?"|\\S+ regex matches either a quoted string or a non-space char sequence. ".*?"|\\S+正则表达式匹配带引号的字符串或非空格字符序列。 These matches then can be returned as collection in one go. 然后可以一次性将这些匹配项作为集合返回。

Here is a demo: https://ideone.com/hmLQIt . 这是一个演示: https : //ideone.com/hmLQIt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM