I had to make application that imports a csv
file to database table. The csv
files are like ~500rows
~30columns
and are from not very reliable source (might contain corrupted data).
I did it like this CSV->DataTable->SqlBulkCopy.WriteToServer(DataTable)
. And it process 500 records to non-local sql server for about 4 sec , which is not a big problem. But since the csv may contain corrupted data (wrong date format, integer overflow, etc.) I had to make it error proof and import good rows and skip bad rows. Problem doesn't occur when processing corrupted data to DataTable but when importing DataTable to DataBase. What I did was TRY {} to add row by row in DB like this.
int Row = 0;
//csvFileData is the DataTable filled with csv data
foreach(DataRow RowToAdd in csvFileData.Rows)
{
//here it spents 1-2% of the time (problem not in DT row by row processing)
Row++;
DataTable TempDT = new DataTable();
TempDT = csvFileData.Clone();
TempDT.ImportRow(RowToAdd);
try
{
//here it spents 98% of the time
s.WriteToServer(TempDT);
}
catch(Exception importex)
{
Console.WriteLine("Couldn't import {0} row, reason", Row, importex.Message);
}
}
calling: s.WriteToServer(scvFileData);
just once is not good in my case.
End it works real fine. The problem is that the time to execute raised to 15sec which is a lot. Because it does forward and backwards communication with the DB for each row. How can I fix this. I was thinking about emulating something like local clone of the DataBase table design. Try {} all row by row on it, then excluding bad ones and after that importing the entire DataTable(with removed bad rows) at ones. Or doing some async import row by row, but I think that rows might get scrambled in there order or get missed or even duplicate. Can someone give a tip.
A bulk insert of one row is >10 times slower than a single row insert. Your current strategy does not work.
Validate and cleanse the data on the client. It must be guaranteed that inserts succeed. Copy it into a DataTable
. Insert all at once, or at least huge batches (perf gains start to appear at 100 or 1000 rows).
The obvious solution, as mentioned, is to verify the data as it is read from the CSV file and filling the data table only with 'good rows'.
In case your verification includes datatype checking, ie if a string is convertible by the target system (here: the SQL-Server), you would duplicate logic here, ie reprogram parsing/conversion logic already implemented in the SQL-Server. This is not a big problem, but from the design aspect not necessarily smart.
In fact you can directly import a CSV file into SQL-Server using the BULK INSERT
command.
So another approach may be to import the raw data into a temporary table in the server and then do a datatype check. This is very easy when you happen to run SQL 2005 or above. They introduce functions like ISDATE
and ISNUMERIC
.
BULK INSERT CSVRawData FROM 'c:\csvtest.txt' WITH (
FIELDTERMINATOR = ',', ROWTERMINATOR = '\n'
)
INSERT INTO FinalTable
SELECT * from CSVRawData
WHERE ISDATE(DateField)
and ISNUMERIC (NumericField)
I would personally go this way if:
Eg we analyze log files that way. They contain 50 Mio+ rows and some of them are corrupted or we simply are not interested.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.