简体   繁体   中英

Scala: Auto detection of delimiter/separator in CSV file

I'm using OpenCSV library for split my CSV files. Now i need to detect the delimiter/separator character with absolute certainty. I have searched on the net but I only found examples where you create a list of candidates and try one of these. I do not think that is the best way because you are likely to get errors. My splitter should work properly on any CSV (of which I have no control) so it has to be as generic as possible. Does anyone have a good solution?

You may have already seen this related SO question , which lists good strategies, like counting the number of times a potential delimiter appears, and/or verifying that each row has the same number of columns when using a hypothetical delimiter.

Unfortunately, absolute certainty is impossible because the format doesn't include a way to specify the delimiter unambiguously within the file. I think the the best solution for making it as generic as possible would be to make the user specify when it isn't a comma (which is how opencsv handles it), or perhaps allow a client to specify the delimiter if you or they determine that automatic detection failed. If this can't be interactive, then I think the best you can do is log the cases where you think it failed so that they can deal with it later.

Also, I think the error rate will be lower than you're expecting. My guess is that 99% of the time the delimiter will be a comma, semicolon, period, or tab. I've unfortunately seen lazy coders use something like a caret, pipe, or tilde to delimit fields under the assumption that the data won't contain it, so they won't have to do proper escaping. But this isn't the norm, and it shouldn't be considered CSV.

The Python csv module has a Sniffer class which guesses delimiters (the user supplies a list of candidates); you may want to look at its implementation .

I've recently been toying with the problem of separator/delimiter detection of CSV files. I've come up with the following which I hope will help others and perhaps receive feedback to improve upon.

My solution is based on several articles I've read on the problem. Because there are no restrictions on what a field delimiter can be, I decided to use the ASCII table and eliminate the obvious (alphanumeric chars) and the not so obvious (non printable) with the exception of the TAB code. Using these values I populated a dictionary with the ASCII code being the key with the value to be filled with my code.

Then it was a matter of reading the CSV line by line, looking through each line for an occurance of any of the dictionary key characters and incrementing the value of each one I came across. The loop continues to the end of the file or for a limit of 100 times in this example. You can change this as you see fit but 100 is more than plenty to detect the delimiter. The delimiter is then determined by the dictionary key (ASCII code) with the greatest value.

Calling routine example

private sub Main()
    dim separator As Char
    separator= separatorDetect(txtInputFile.Text)
end sub

Main detection function

Private Function separatorDetect(ByVal StrFileName As String) As Char
    Dim i As Int16 = 0
    Dim separator As List(Of Char)
    Dim dictSeparators As New Dictionary(Of Integer, Integer)
    dictSeparators.Add(9, 0)
    dictSeparators.Add(33, 0)
    For i = 35 To 47
        dictSeparators.Add(i, 0)
    Next
    For i = 91 To 96
        dictSeparators.Add(i, 0)
    Next
    For i = 123 To 126
        dictSeparators.Add(i, 0)
    Next
    Dim lineCounter As Integer = 0
    Dim line As String = String.Empty
    Dim keyList As New List(Of Integer)
    For Each key In dictSeparators.Keys
        keyList.Add(key)
    Next
    Dim tmp As Char
    Using textReader = New StreamReader(StrFileName)
        Do Until textReader.EndOfStream
            line = textReader.ReadLine.Trim
            For Each key In keyList
                tmp = Convert.ToChar(key)
                dictSeparators.Item(key) = dictSeparators.Item(key) + InStrCount(line, tmp)
            Next
            lineCounter += 1
            If lineCounter = 99 Then GoTo readEnd
        Loop
    End Using
readEnd:
    Dim max = dictSeparators.Aggregate(Function(l, r) If(l.Value > r.Value, l, r)).Key
    Return Chr(max)
End Function

Recursive indexof count function

Private Function InStrCount(ByVal SourceString As String, ByVal SearchString As Char, _
                Optional ByRef StartPos As Integer = 0, _
                Optional ByRef Count As Integer = 0) As Integer
    If SourceString.IndexOf(SearchString, StartPos) > -1 Then
        Count += 1
        InStrCount(SourceString, SearchString, SourceString.IndexOf(SearchString, StartPos) + 1, Count)
    End If
    Return Count
End Function

This works for me but I'm always happy to be shown a better more optimised way.

如何确定CSV文件中的定界符中,我提到了Univocity-Parsers ,它似乎是一个维护良好且流行的库,实际上提供了可为您处理检测的API。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM