简体   繁体   中英

Parse comma seperated string with a complication in C#

I know how to get substrings from a string which are coma seperated but here's a complication: what if substring contains a coma.

If a substring contains a coma, new line or double quotes the entire substring is encapsulated with double quotes.

If a substring contains a double quote the double quote is escaped with another double quote. Worst case scenario would be if I have something like this:

first,"second, second","""third"" third","""fourth"", fourth"

In this case substrings are:

  • first
  • second, second
  • "third" third
  • "fourth", fourth

second, second is encapsulated with double quotes, I don't want those double quotes in a list/array.

"third" third is encapsulated with double quotes because it contains double quotes and those are escaped with aditional double quotes. Again I don't want the encapsulating double quotes in a list/array and i don't want the double quotes that escape double quotes, but I want original double quotes which are a part of the substring.

One way using TextFieldParser :

using (var reader = new StringReader("first,\"second, second\",\"\"\"third\"\" third\",\"\"\"fourth\"\", fourth\""))    
using (var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader))
{
    parser.Delimiters = new[] { "," };
    parser.HasFieldsEnclosedInQuotes = true;

    while (!parser.EndOfData)
    {
        foreach (var field in parser.ReadFields())
            Console.WriteLine(field);
    }
}

For

first
second, second
"third" third
"fourth", fourth

Try this

  string input = "first,\\"second, second\\",\\"\\"\\"third\\"\\" third\\",\\"\\"\\"fourth\\"\\", fourth\\""; string[] output = input.Split(new string[] {"\\",\\""}, StringSplitOptions.RemoveEmptyEntries); 

I would suggest you to construct a small state machine for this problem. You would have states like:

  • Out - before the first field is reached
  • InQuoted - you were Out and " arrived; now you're in and the field is quoted
  • InQuotedMaybeOut - you were InQuoted and " arrived; now you wait for the next character to figure whether it is another " or something else; if else, then select the next valid state (character could be space, new line, comma, so you decide the next state); otherwise, if " arrived, you push " to the output and step back to InQuoted
  • In - after Out, when any character has arrived except , and ", you are automatically inside a new field which is not quoted.

This will certainly read CSV correctly. You can also make the separator configurable, so that you support TSV or semicolon-separated format.

Also keep in mind one very important case in CSV format: Quoted field may contain new line! Another special case to keep an eye on: empty field (like: ,,).

This is not the most elegant solution but it might help you. I would loop through the characters and do an odd-even count of the quotes. For example you have a bool that is true if you have encountered an odd number of quotes and false for an even number of quotes.

Any comma encountered while this bool value is true should not be considered as a separator. If you know it is a separator you can do several things with that information. Below I replaced the delimiter with something more manageable (not very efficient though):

bool odd = false;
char replacementDelimiter = "|"; // Or some very unlikely character

for(int i = 0; i < str.len; ++i)
{
   if(str[i] == '\"')
       odd = !odd;
   else if (str[i] == ',')
   {
      if(!odd)
          str[i] = replacementDelimiter;
   }
}

string[] commaSeparatedTokens = str.Split(replacementDelimiter);

At this point you should have an array of strings that are separated on the commas that you have intended. From here on it will be simpler to handle the quotes.

I hope this can help you.

Mini parser

using System;
using System.Collections.Generic;
using System.Text;

namespace ConsoleApp
{
    class Program
    {
        private static IEnumerable<string> Parse(string input)
        {
            if (string.IsNullOrWhiteSpace(input))
            {
                // empty string => nothing to do
                yield break;
            }

            int count = input.Length;
            StringBuilder sb = new StringBuilder();
            int j;

            for (int i = 0; i < count; i++)
            {
                char c = input[i];
                if (c == ',')
                {
                    yield return sb.ToString();
                    sb.Clear();
                }
                else if (c == '"')
                {
                    // begin quoted string
                    sb.Clear();
                    for (j = i + 1; j < count; j++)
                    {
                        if (input[j] == '"')
                        {
                            // quote
                            if (j < count - 1 && input[j + 1] == '"')
                            {
                                // double quote
                                sb.Append('"');
                                j++;
                            }
                            else
                            {
                                break;
                            }
                        }
                        else
                        {
                            sb.Append(input[j]);
                        }
                    }
                    yield return sb.ToString();

                    // clear buffer and skip to next comma
                    sb.Clear();
                    for (i = j + 1; i < count && input[i] != ','; i++) ;
                }
                else
                {
                    sb.Append(c);
                }
            }
        }

        [STAThread]
        static void Main(string[] args)
        {
            foreach (string str in Parse("first,\"second, second\",\"\"\"third\"\" third\",\"\"\"fourth\"\", fourth\""))
            {
                Console.WriteLine(str);
            }

            Console.WriteLine();
            Console.WriteLine("Press any key to continue...");
            Console.ReadKey();
        }
    }
}

Result

  • first
  • second, second
  • "third" third
  • "fourth", fourth

Thank you for your answers, but before I got to see them I wrote this solution, it's not pretty but it works for me.

string line = "first,\"second, second\",\"\"\"third\"\" third\",\"\"\"fourth\"\", fourth\"";
var substringArray = new List<string>();
string substring = null;
var doubleQuotesCount = 0;

for (var i = 0; i < line.Length; i++)
{
  if (line[i] == ',' && (doubleQuotesCount % 2) == 0)
  {
    substringArray.Add(substring);
    substring = null;
    doubleQuotesCount = 0;
    continue;
  }
  else
  {
    if (line[i] == '"')
      doubleQuotesCount++;

    substring += line[i];

    //If it is a last character
    if (i == line.Length - 1)
    {
      substringArray.Add(substring);
      substring = null;
      doubleQuotesCount = 0;
    }
  }
}

for(var i = 0; i < substringArray.Count; i++)
{
  if (substringArray[i] != null)
  {
    //remove first double quote
    if (substringArray[i][0] == '"')
    {
      substringArray[i] = substringArray[i].Substring(1);
    }
    //remove last double quote
    if (substringArray[i][substringArray[i].Length - 1] == '"')
    {
      substringArray[i] = substringArray[i].Remove(substringArray[i].Length - 1);
    }
    //Replace double double quotes with single double quote
    substringArray[i] = substringArray[i].Replace("\"\"", "\"");
  }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM