简体   繁体   中英

Adding a Way To preserve A Comma In A CSV To DataTable Function

I have a function that converts a .csv file to a datatable. One of the columns I am converting is is a field of names that have a comma in them ie "Doe, John" when converting the function treats this as 2 seperate fields because of the comma. I need the datatable to hold this as one field Doe, John in the datatable.

Function CSV2DataTable(ByVal filename As String, ByVal sepChar As String) As DataTable
    Dim reader As System.IO.StreamReader
    Dim table As New DataTable
    Dim colAdded As Boolean = False

    Try
        ''# open a reader for the input file, and read line by line
        reader = New System.IO.StreamReader(filename)
        Do While reader.Peek() >= 0
            ''# read a line and split it into tokens, divided by the specified 
            ''# separators
            Dim tokens As String() = System.Text.RegularExpressions.Regex.Split _
                (reader.ReadLine(), sepChar)
            ''# add the columns if this is the first line
            If Not colAdded Then
                For Each token As String In tokens
                    table.Columns.Add(token)
                Next
                colAdded = True
            Else
                ''# create a new empty row
                Dim row As DataRow = table.NewRow()
                ''# fill the new row with the token extracted from the current 
                ''# line
                For i As Integer = 0 To table.Columns.Count - 1
                    row(i) = tokens(i)
                Next
                ''# add the row to the DataTable
                table.Rows.Add(row)
            End If
        Loop

        Return table
    Finally
        If Not reader Is Nothing Then reader.Close()
    End Try
End Function

Instead of rolling out your own solution have you considered using

http://www.filehelpers.net/

It should address your issue.

Don't use a .Split() function to read your csv data. Not only does it cause the kind of error you just ran into but it's slower as well. You need a state machine -based parser. That will be faster and make it easier to correctly handle quote-enclosed text.

I have an example here:
Reading CSV files in C#

and there's also a highly-respected CSV reader on codeplex you can use:
http://www.codeproject.com/KB/database/CsvReader.aspx


You'd use my code like this:

Function DataTableFromCSV(ByVal filename As String) As DataTable
    Dim table As New DataTable
    Dim colAdded As Boolean = False

    For Each record As IList(Of String) In CSV.FromFile(filename)
        ''# Add column headers on first iteration
        If Not colAdded Then
            For Each token As String In record
                table.Columns.Add(token)
            Next token
            colAdded = True
        Else
            ''# add the row to the table
            Dim row As DataRow = table.NewRow()
            For i As Integer = 0 To table.Columns.Count - 1
                row(i) = record(i)
            Next
            table.Rows.Add(row)
        End If
    Next record

    Return table
End Function  

If you're using .net 3.5 or later, I'd write it a little differently to pull the column creation out of the for each loop (using type inference and .Take(1) ), but I wanted to be sure this would work with .Net 2.0 as well.

I can't help you with the VB.NET side of things, but RFC 4180 is your friend. Specifically, section 2:

5. Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:

  "aaa","bbb","ccc" CRLF zzz,yyy,xxx 

6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:

  "aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx 

7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:

  "aaa","b""bb","ccc" 

Try splitting with the " and then skipping every second element starting from the first. example:

 "Test,Name","123 Street,","NY","12345" 
Dim tokens As String() =  = System.Text.RegularExpressions.Regex.Split _
            (reader.ReadLine(), """") 'Or send in " as the sepchar

You would get

 {Length=9}
    (0): ""
    (1): "Test,Name"
    (2): ","
    (3): "123 Street,"
    (4): ","
    (5): "NY"
    (6): ","
    (7): "12345"
    (8): ""

So you would take the Odd numbered elements only to retrieve the data. The only caveat is when there is also a " in the data file.

I still think you should reconsider not using an external library.

Here is a article that addresses it. http://www.secretgeek.net/csv_trouble.asp

Like others have said, don't roll your own.

Give CSVHelper a try (a library I maintain). It's separated into a parser and reader, so you can just use the parser if you want. The parsing code is pretty straight forward and RFC 4180 compliant, if you want to look at it's source.

Have you looked into using the TextFieldParser class that's built into the .Net framework?

It has a property called HasFieldsEnclosedInQuotes that should handle your situation.

You can set the delimiters, and then call the ReadLine and ReadFields methods to get the field data, and it should account for those fields enclosed in quotation marks.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM