简体   繁体   English

在C#中使用逗号解析逗号分隔的字符串

[英]Parse comma seperated string with a complication in C#

I know how to get substrings from a string which are coma seperated but here's a complication: what if substring contains a coma. 我知道如何从逗号分隔的字符串中获取子字符串,但这是一个复杂问题:如果子字符串包含逗号,该怎么办。

If a substring contains a coma, new line or double quotes the entire substring is encapsulated with double quotes. 如果子字符串包含逗号,换行符或双引号,则整个子字符串将用双引号封装。

If a substring contains a double quote the double quote is escaped with another double quote. 如果子字符串包含双引号,则双引号将被另一个双引号转义。 Worst case scenario would be if I have something like this: 最糟糕的情况是如果我有这样的事情:

first,"second, second","""third"" third","""fourth"", fourth"

In this case substrings are: 在这种情况下,子字符串为:

  • first 第一
  • second, second 第二,第二
  • "third" third “第三”
  • "fourth", fourth “第四”,第四

second, second is encapsulated with double quotes, I don't want those double quotes in a list/array. 第二,第二个是用双引号封装的,我不想在列表/数组中使用那些双引号。

"third" third is encapsulated with double quotes because it contains double quotes and those are escaped with aditional double quotes. “第三”三分之一用双引号封装,因为它包含双引号,而那些用附加双引号转义。 Again I don't want the encapsulating double quotes in a list/array and i don't want the double quotes that escape double quotes, but I want original double quotes which are a part of the substring. 同样,我不希望将双引号封装在列表/数组中,也不希望使用双引号来转义双引号,但是我希望原始双引号是子字符串的一部分。

One way using TextFieldParser : 一种使用TextFieldParser

using (var reader = new StringReader("first,\"second, second\",\"\"\"third\"\" third\",\"\"\"fourth\"\", fourth\""))    
using (var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader))
{
    parser.Delimiters = new[] { "," };
    parser.HasFieldsEnclosedInQuotes = true;

    while (!parser.EndOfData)
    {
        foreach (var field in parser.ReadFields())
            Console.WriteLine(field);
    }
}

For 对于

first
second, second
"third" third
"fourth", fourth

Try this 尝试这个

  string input = "first,\\"second, second\\",\\"\\"\\"third\\"\\" third\\",\\"\\"\\"fourth\\"\\", fourth\\""; string[] output = input.Split(new string[] {"\\",\\""}, StringSplitOptions.RemoveEmptyEntries); 

I would suggest you to construct a small state machine for this problem. 我建议您为此问题构造一个小型状态机。 You would have states like: 您将具有以下状态:

  • Out - before the first field is reached 外出-在到达第一个字段之前
  • InQuoted - you were Out and " arrived; now you're in and the field is quoted InQuoted-您已离开并到达;现在您已进入并引用了该字段
  • InQuotedMaybeOut - you were InQuoted and " arrived; now you wait for the next character to figure whether it is another " or something else; InQuotedMaybeOut-您已被InQuoted并“到达”;现在您等待下一个字符确定是另一个字符还是其他字符; if else, then select the next valid state (character could be space, new line, comma, so you decide the next state); 如果不是,则选择下一个有效状态(字符可以是空格,换行,逗号,以便您确定下一个状态); otherwise, if " arrived, you push " to the output and step back to InQuoted 否则,如果“到达,则将”推送到输出并返回到InQuoted
  • In - after Out, when any character has arrived except , and ", you are automatically inside a new field which is not quoted. 由内至外,当除和之外的任何字符到达时,您将自动进入未引用的新字段。

This will certainly read CSV correctly. 这肯定会正确读取CSV。 You can also make the separator configurable, so that you support TSV or semicolon-separated format. 您还可以使分隔符可配置,以便支持TSV或分号分隔的格式。

Also keep in mind one very important case in CSV format: Quoted field may contain new line! 另外请记住CSV格式的一个非常重要的情况:引用的字段可能包含换行! Another special case to keep an eye on: empty field (like: ,,). 另一个需要注意的特殊情况:空字段(例如:、、)。

This is not the most elegant solution but it might help you. 这不是最优雅的解决方案,但可能会对您有所帮助。 I would loop through the characters and do an odd-even count of the quotes. 我会遍历字符并对引号进行奇偶计数。 For example you have a bool that is true if you have encountered an odd number of quotes and false for an even number of quotes. 例如,如果遇到引号奇数,则布尔值为true;对于偶数引号,布尔值为false。

Any comma encountered while this bool value is true should not be considered as a separator. 当此布尔值为true时遇到的任何逗号都不应视为分隔符。 If you know it is a separator you can do several things with that information. 如果您知道它是分隔符,则可以使用该信息执行几项操作。 Below I replaced the delimiter with something more manageable (not very efficient though): 下面,我用更易于管理的分隔符代替了分隔符(尽管效率不高):

bool odd = false;
char replacementDelimiter = "|"; // Or some very unlikely character

for(int i = 0; i < str.len; ++i)
{
   if(str[i] == '\"')
       odd = !odd;
   else if (str[i] == ',')
   {
      if(!odd)
          str[i] = replacementDelimiter;
   }
}

string[] commaSeparatedTokens = str.Split(replacementDelimiter);

At this point you should have an array of strings that are separated on the commas that you have intended. 此时,您应该有一个字符串数组,这些字符串在您想要的逗号之间分开。 From here on it will be simpler to handle the quotes. 从这里开始,将更容易处理报价。

I hope this can help you. 希望对您有所帮助。

Mini parser 迷你解析器

using System;
using System.Collections.Generic;
using System.Text;

namespace ConsoleApp
{
    class Program
    {
        private static IEnumerable<string> Parse(string input)
        {
            if (string.IsNullOrWhiteSpace(input))
            {
                // empty string => nothing to do
                yield break;
            }

            int count = input.Length;
            StringBuilder sb = new StringBuilder();
            int j;

            for (int i = 0; i < count; i++)
            {
                char c = input[i];
                if (c == ',')
                {
                    yield return sb.ToString();
                    sb.Clear();
                }
                else if (c == '"')
                {
                    // begin quoted string
                    sb.Clear();
                    for (j = i + 1; j < count; j++)
                    {
                        if (input[j] == '"')
                        {
                            // quote
                            if (j < count - 1 && input[j + 1] == '"')
                            {
                                // double quote
                                sb.Append('"');
                                j++;
                            }
                            else
                            {
                                break;
                            }
                        }
                        else
                        {
                            sb.Append(input[j]);
                        }
                    }
                    yield return sb.ToString();

                    // clear buffer and skip to next comma
                    sb.Clear();
                    for (i = j + 1; i < count && input[i] != ','; i++) ;
                }
                else
                {
                    sb.Append(c);
                }
            }
        }

        [STAThread]
        static void Main(string[] args)
        {
            foreach (string str in Parse("first,\"second, second\",\"\"\"third\"\" third\",\"\"\"fourth\"\", fourth\""))
            {
                Console.WriteLine(str);
            }

            Console.WriteLine();
            Console.WriteLine("Press any key to continue...");
            Console.ReadKey();
        }
    }
}

Result 结果

  • first 第一
  • second, second 第二,第二
  • "third" third “第三”
  • "fourth", fourth “第四”,第四

Thank you for your answers, but before I got to see them I wrote this solution, it's not pretty but it works for me. 谢谢您的回答,但是在我看到它们之前,我写了这个解决方案,它虽然不漂亮,但对我有用。

string line = "first,\"second, second\",\"\"\"third\"\" third\",\"\"\"fourth\"\", fourth\"";
var substringArray = new List<string>();
string substring = null;
var doubleQuotesCount = 0;

for (var i = 0; i < line.Length; i++)
{
  if (line[i] == ',' && (doubleQuotesCount % 2) == 0)
  {
    substringArray.Add(substring);
    substring = null;
    doubleQuotesCount = 0;
    continue;
  }
  else
  {
    if (line[i] == '"')
      doubleQuotesCount++;

    substring += line[i];

    //If it is a last character
    if (i == line.Length - 1)
    {
      substringArray.Add(substring);
      substring = null;
      doubleQuotesCount = 0;
    }
  }
}

for(var i = 0; i < substringArray.Count; i++)
{
  if (substringArray[i] != null)
  {
    //remove first double quote
    if (substringArray[i][0] == '"')
    {
      substringArray[i] = substringArray[i].Substring(1);
    }
    //remove last double quote
    if (substringArray[i][substringArray[i].Length - 1] == '"')
    {
      substringArray[i] = substringArray[i].Remove(substringArray[i].Length - 1);
    }
    //Replace double double quotes with single double quote
    substringArray[i] = substringArray[i].Replace("\"\"", "\"");
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM