简体   繁体   中英

CSV Parsing

I am trying to use C# to parse CSV. I used regular expressions to find "," and read string if my header counts were equal to my match count.

Now this will not work if I have a value like:

"a",""b","x","y"","c"

then my output is:

'a'
'"b'
'x'
'y"'
'c'

but what I want is:

'a'
'"b","x","y"'
'c'

Is there any regex or any other logic I can use for this ?

CSV, when dealing with things like multi-line, quoted, different delimiters* etc - can get trickier than you might think... perhaps consider a pre-rolled answer? I use this , and it works very well.

*=remember that some locales use [tab] as the C in CSV...

CSV is a great example for code reuse - No matter which one of the csv parsers you choose, don't choose your own. Stop Rolling your own CSV parser

I would use FileHelpers if I were you. Regular Expressions are fine but hard to read, especially if you go back, after a while, for a quick fix.

Just for sake of exercising my mind, quick & dirty working C# procedure:

public static List<string> SplitCSV(string line)
{
    if (string.IsNullOrEmpty(line))
        throw new ArgumentException();

    List<string> result = new List<string>();

    bool inQuote = false;
    StringBuilder val = new StringBuilder();

    // parse line
    foreach (var t in line.Split(','))
    {
        int count = t.Count(c => c == '"');

        if (count > 2 && !inQuote)
        {
            inQuote = true;
            val.Append(t);
            val.Append(',');
            continue;
        }

        if (count > 2 && inQuote)
        {
            inQuote = false;
            val.Append(t);
            result.Add(val.ToString());
            continue;
        }

        if (count == 2 && !inQuote)
        {
            result.Add(t);
            continue;
        }

        if (count == 2 && inQuote)
        {
            val.Append(t);
            val.Append(',');
            continue;
        }
    }

    // remove quotation
    for (int i = 0; i < result.Count; i++)
    {
        string t = result[i];
        result[i] = t.Substring(1, t.Length - 2);
    }

    return result;
}

Try CsvHelper (a library I maintain) or FastCsvReader . Both work well. CsvHelper does writing also. Like everyone else has been saying, don't roll your own. :P

In order to have a parseable CSV file, any double quotes inside a value need to be properly escaped somehow. The two standard ways to do this are by representing a double quote either as two double quotes back to back, or a backslash double quote. That is one of the following two forms:

""

\\"

In the second form your initial string would look like this:

"a","\\"b\\",\\"x\\",\\"y\\"","c"

If your input string is not formatted against some rigorous format like this then you have very little chance of successfully parsing it in an automated environment.

If all your values are guaranteed to be in quotes, look for values, not for commas:

("".*?""|"[^"]*")

This takes advantage of the fact that "the earliest longest match wins" - it looks for double quoted values first, and with a lower priority for normal quoted values.

If you don't want the enclosing quote to be part of the match, use:

"(".*?"|[^"]*)"

and go for the value in match group 1.

As I said: Prerequisite for this to work is well-formed input with guaranteed quotes or double quotes around each value. Empty values must be quoted as well! A nice side-effect is that it does not care for the separator char. Commas, TABs, semi-colons, spaces, you name it. All will work.

There's an oft quoted saying:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (Jamie Zawinski)

Given that there's no official standard for CSV files (instead there are a large number of slightly incompatible styles), you need to make sure that what you implement suits the files you will be receiving. No point in implementing anything fancier than what you need - and I'm pretty sure you don't need Regular Expressions.

Here's my stab at a simple method to extract the terms - basically, it loops through the line looking for commas, keeping track of whether the current index is within a string or not:

    public IEnumerable<string> SplitCSV(string line)
    {
        int index = 0;
        int start = 0;
        bool inString = false;

        foreach (char c in line)
        {
            switch (c)
            {
                case '"':
                    inString = !inString;
                    break;

                case ',':
                    if (!inString)
                    {
                        yield return line.Substring(start, index - start);
                        start = index + 1;
                    }
                    break;
            }
            index++;
        }

        if (start < index)
            yield return line.Substring(start, index - start);
    }

Standard caveat - untested code, there may be off-by-one errors.

Limitations

  • The quotes around a value aren't removed automatically.
    To do this, add a check just before the yield return statement near the end.

  • Single quotes aren't supported in the same way as double quotes
    You could add a separate boolean inSingleQuotedString , renaming the existing boolean to inDoubleQuotedString and treating both the same way. (You can't make the existing boolean do double work because you need the string to end with the same quote that started it.)

  • Whitespace isn't automatically removed
    Some tools introduce whitespace around the commas in CSV files to "pretty" the file; it then becomes difficult to tell intentional whitespace from formatting whitespace.

FileHelpers supports multiline fields.

You could parse files like these:

a,"line 1
line 2
line 3"
b,"line 1
line 2
line 3"

Here is the datatype declaration:

[DelimitedRecord(",")]
public class MyRecord
{ 
 public string field1;
 [FieldQuoted('"', QuoteMode.OptionalForRead, MultilineMode.AllowForRead)]
 public string field2;
}

Here is the usage:

static void Main()
{
 FileHelperEngine engine = new FileHelperEngine(typeof(MyRecord));
 MyRecord[] res = engine.ReadFile("file.csv");       
}

Well, I'm no regex wiz, but I'm certain they have an answer for this.

Procedurally it's going through letter by letter. Set a variable, say dontMatch, to FALSE.

Each time you run into a quote toggle dontMatch.

each time you run into a comma, check dontMatch. If it's TRUE, ignore the comma. If it's FALSE, split at the comma.

This works for the example you give, but the logic you use for quotation marks is fundamentally faulty - you must escape them or use another delimiter (single quotes, for instance) to set major quotations apart from minor quotations.

For instance,

"a", ""b", ""c", "d"", "e""

will yield bad results.

This can be fixed with another patch. Rather than simply keeping a true false you have to match quotes.

To match quotes you have to know what was last seen, which gets into pretty deep parsing territory. You'll probably, at that point, want to make sure your language is designed well, and if it is you can use a compiler tool to create a parser for you.

-Adam

.Net的FileHelpers是你的朋友。

See the link "Regex fun with CSV" at:

http://snippets.dzone.com/posts/show/4430

The Lumenworks CSV parser (open source, free but needs a codeproject login) is by far the best one I've used. It'll save you having to write the regex and is intuitive to use.

I have just try your regular expression in my code..its work fine for formated text with quote ...

but wondering if we can parse below value by Regex..

"First_Bat7679",""NAME","ENAME","FILE"","","","From: "DDD,_Ala%as"@sib.com"

I am looking for result as:

'First_Bat7679'
'"NAME","ENAME","FILE"'
''
''
'From: "DDD,_Ala%as"@sib.com'

Thanx

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM