SqlBulkCopy.WriteToServer(DataTable) row by row: very slow

Question

I had to make application that imports a csv file to database table. The csv files are like ~500rows ~30columns and are from not very reliable source (might contain corrupted data).

I did it like this CSV->DataTable->SqlBulkCopy.WriteToServer(DataTable) . And it process 500 records to non-local sql server for about 4 sec , which is not a big problem. But since the csv may contain corrupted data (wrong date format, integer overflow, etc.) I had to make it error proof and import good rows and skip bad rows. Problem doesn't occur when processing corrupted data to DataTable but when importing DataTable to DataBase. What I did was TRY {} to add row by row in DB like this.

int Row = 0;
//csvFileData is the DataTable filled with csv data

foreach(DataRow RowToAdd in csvFileData.Rows) 
{
    //here it spents 1-2% of the time (problem not in DT  row by row processing)
    Row++;
    DataTable TempDT = new DataTable();
    TempDT = csvFileData.Clone();
    TempDT.ImportRow(RowToAdd);

    try 
    { 
        //here it spents 98% of the time
        s.WriteToServer(TempDT);

    } 
    catch(Exception importex)
    {
        Console.WriteLine("Couldn't import {0} row, reason", Row, importex.Message);
    }

}

calling: s.WriteToServer(scvFileData);
just once is not good in my case.

End it works real fine. The problem is that the time to execute raised to 15sec which is a lot. Because it does forward and backwards communication with the DB for each row. How can I fix this. I was thinking about emulating something like local clone of the DataBase table design. Try {} all row by row on it, then excluding bad ones and after that importing the entire DataTable(with removed bad rows) at ones. Or doing some async import row by row, but I think that rows might get scrambled in there order or get missed or even duplicate. Can someone give a tip.

Answer 1

A bulk insert of one row is >10 times slower than a single row insert. Your current strategy does not work.

Validate and cleanse the data on the client. It must be guaranteed that inserts succeed. Copy it into a DataTable . Insert all at once, or at least huge batches (perf gains start to appear at 100 or 1000 rows).

Answer 2

The obvious solution, as mentioned, is to verify the data as it is read from the CSV file and filling the data table only with 'good rows'.

In case your verification includes datatype checking, ie if a string is convertible by the target system (here: the SQL-Server), you would duplicate logic here, ie reprogram parsing/conversion logic already implemented in the SQL-Server. This is not a big problem, but from the design aspect not necessarily smart.

In fact you can directly import a CSV file into SQL-Server using the BULK INSERT command.

So another approach may be to import the raw data into a temporary table in the server and then do a datatype check. This is very easy when you happen to run SQL 2005 or above. They introduce functions like ISDATE and ISNUMERIC .

BULK INSERT CSVRawData FROM 'c:\csvtest.txt' WITH (
  FIELDTERMINATOR = ',', ROWTERMINATOR = '\n'
)

INSERT INTO FinalTable 
SELECT * from CSVRawData
 WHERE ISDATE(DateField) 
   and ISNUMERIC (NumericField)

I would personally go this way if:

The CSV file has a fixed format
The integrity checks being made are easy to code in SQL

Eg we analyze log files that way. They contain 50 Mio+ rows and some of them are corrupted or we simply are not interested.

SqlBulkCopy.WriteToServer(DataTable) row by row: very slow

Question

2 answers

solution1
1 2014-10-25 09:02:22

solution2
1 2014-10-25 09:16:00

SqlBulkCopy.WriteToServer(DataTable) row by row: very slow

Question

2 answers

solution1 1 2014-10-25 09:02:22

solution2 1 2014-10-25 09:16:00

solution1
1 2014-10-25 09:02:22

solution2
1 2014-10-25 09:16:00