SqlBulkCopy and DataTables with Parent/Child Relation on Identity Column

Question

We have a need to update several tables that have parent/child relationships based on an Identity primary-key in the parent table, which is referred to by one or more child tables as a foreign key.

Due to the high volume of data, we would like to build these tables in memory, then use SqlBulkCopy from C# to update the database en mass from either the DataSet or the individual DataTables.
We would further like to do this in parallel, from multiple threads, processes, and possibly clients.

Our prototype in F# shows a lot of promise, with a 34x performance increase, but this code forces known Identity values in the parent table. When not forced, the Identity column does get correctly generated in the database when SqlBulkCopy inserts the rows, but the Identity values do NOT get updated in the in-memory DataTable. Further, even if they were, it is not clear if the DataSet would correctly fix-up the parent/child relationships, so that the child tables could subsequently be written with correct foreign key values.

Can anyone explain how to have SqlBulkCopy update Identity values, and further how to configure a DataSet so as to retain and update parent/child relationships, if this is not done automatically when a DataAdapter is called to FillSchema on the individual DataTables.

Answers that I'm not looking for:

Read the database to find the current highest Identity value, then manually increment it when creating each parent row. Does not work for multiple processes/clients and as I understand it failed transactions may cause some Identity values to be skipped, so this method could screw up the relation.
Write the parent rows one-at-a-time and ask for the Identity value back. This defeats at least some of the gains had by using SqlBulkCopy (yes, there are a lot more child rows than parents ones, but there are still a lot of parent rows).

Similar to the following unanswered question:

How to update Dataset Parent & Child tables with Autogenerated Identity Key?

Answer 1

First of all: SqlBulkCopy is not possible to do what you want. As the name suggests, it's just a "one way street". I moves data into sql server as quick as possible. It's the .Net version of the old bulk copy command which imports raw text files into tables. So there is no way to get the identity values back if you are using SqlBulkCopy.

I have done a lot of bulk data processing and have faced this problem several times. The solution depends on your architecture and data distribution. Here are some ideas:

Create one set of target tables for each thread, import in these tables. At the end join these tables. Most of this can implemented in a quite generic way where you generate tables called TABLENAME_THREAD_ID automatically from tables called TABLENAME.
Move ID generation completly out of the database. For example, implement a central webservice which generates the IDs. In that case you should not generate one ID per call but rather generate ID ranges. Otherwise network overhead becomes usually a bottle neck.
Try to generate IDs out your data. If it's possible, your problem would have been gone. Don't say "it's not possible" to fast. Perhaps you can use string ids which can be cleaned up in a post processing step?

And one more remark: An increase of factor 34 when using BulkCopy sounds to small in opinion. If you want to insert data fast, make sure that your database is configured correctly.

Answer 2

Read this article. I think this is exactly what you are looking for and more. Very nice and elegant solution.

http://www.codinghelmet.com/?path=howto/bulk-insert

Answer 3

The only way you could do what you want by using SqlBulkCopy is to first insert the data to a staging table. Then use a stored procedure to distribute the data to the destinate tables. Yes, this will cause a slowdown but it will still be fast.

You might also consider redesigning your data, ie splitting it up, denormalizing it etc.

Answer 4

set identity_insert <table> on and dbcc checkident are your friends here. This is something like what I've done in the past (see code sample). The only real caveat is that the update process is the only one that can be inserting data: everybody else has to get out of the pool while the update is going on. You could, of course, do this sort of mapping programmatically prior to loading the production tables. But the same restriction on the inserts applies: the update process is the only process that gets to play.

--
-- start with a source schema -- doesn't actually need to be SQL tables
-- but from the standpoint of demonstration, it makes it easier
--
create table source.parent
(
  id   int         not null primary key ,
  data varchar(32) not null ,
)
create table source.child
(
  id        int         not null primary key ,
  data      varchar(32) not null ,
  parent_id int         not null foreign key references source.parent(id) ,
)

--
-- On the receiving end, you need to create staging tables.
-- You'll notice that while there are primary keys defined,
-- there are no foreign key constraints. Depending on the
-- cleanliness of your data, you might even get rid of the
-- primary key definitions (though you'll need to add
-- some sort of processing to clean the data one way or
-- another, obviously).
--
-- and, depending context, these could even be temp tables
--
create table stage.parent
(
  id   int         not null primary key ,
  data varchar(32) not null ,
)

create table stage.child
(
  id        int         not null primary key ,
  data      varchar(32) not null ,
  parent_id int         not null ,
)

--
-- and of course, the final destination tables already exist,
-- complete with identity properties, etc.
--
create table dbo.parent
(
  id int not null identity(1,1) primary key ,
  data varchar(32) not null ,
)
create table dbo.child
(
  id int not null identity(1,1) primary key ,
  data varchar(32) not null ,
  parent_id int not null foreign key references dbo.parent(id) ,
)

-----------------------------------------------------------------------
-- so, you BCP or otherwise load your staging tables with the new data
-- frome the source tables. How this happens is left as an exercise for
-- the reader. We'll just assume that some sort of magic happens to
-- make it so. Don't forget to truncate the staging tables prior to
-- loading them with data.
-----------------------------------------------------------------------

-------------------------------------------------------------------------
-- Now we get to work to populate the production tables with the new data
--
-- First we need a map to let us create the new identity values.
-------------------------------------------------------------------------
drop table #parent_map
create table #parent_map
(
  old_id int not null primary key nonclustered       ,
  offset int not null identity(1,1) unique clustered ,
  new_id int     null ,  
)
create table #child_map
(
  old_id int not null primary key nonclustered ,
  offset int not null identity(1,1) unique clustered ,
  new_id int     null ,
)

insert #parent_map ( old_id ) select id from stage.parent
insert #child_map  ( old_id ) select id from stage.child

-------------------------------------------------------------------------------
-- now that we've got the map, we can blast the data into the production tables
-------------------------------------------------------------------------------

--
-- compute the new ID values
--
update #parent_map set new_id = offset + ( select max(id) from dbo.parent )

--
-- blast it into the parent table, turning on identity_insert
--
set identity_insert dbo.parent on

insert dbo.parent (id,data)
select id   = map.new_id   ,
       data = staging.data
from stage.parent staging
join #parent_map  map     on map.old_id = staging.id

set identity_insert dbo.parent off

--
-- reseed the identity properties high water mark
--
dbcc checkident dbo.parent , reseed


--
-- compute the new ID values
--
update #child_map set new_id = offset + ( select max(id) from dbo.child )

--
-- blast it into the child table, turning on identity_insert
--
set identity_insert dbo.child on

insert dbo.child ( id , data , parent_id )
select id        = parent.new_id   ,
       data      = staging.data    ,
       parent_id = parent.new_id

from stage.child staging
join #child_map  map      on map.old_id    = staging.id
join #parent_map parent   on parent.old_id = staging.parent_id

set identity_insert dbo.child off

--
-- reseed the identity properties high water mark
--
dbcc checkident dbo.child , reseed

------------------------------------
-- That's about all there is too it.
------------------------------------

Answer 5

I guess the trade off you face is the performance of the BulkInsert vs the reliabilty of the Identity.

Can you put the database into SingleUserMode temporarily to perform your insert?

I faced a very similar issue with my conversion project where I am adding an Identity column to very large tables, and they have children. Fortunately I was able to setup the identity the parent and child sources (i used a TextDataReader) to perform the BulkInsert, and I generated the Parent and child files at the same time.

I also gained the performance gains you are talking about, OleDBDataReader Source -> StreamWriter ... and then TextDataReader -> SQLBulk

SqlBulkCopy and DataTables with Parent/Child Relation on Identity Column

Question

5 answers

solution1
9 2009-07-31 19:54:52

solution2
4 2013-04-11 14:06:28

solution3
1 2011-02-25 20:18:26

solution4
1 2011-02-25 21:03:32

solution5
0 2009-07-29 07:44:18

SqlBulkCopy and DataTables with Parent/Child Relation on Identity Column

Question

5 answers

solution1 9 2009-07-31 19:54:52

solution2 4 2013-04-11 14:06:28

solution3 1 2011-02-25 20:18:26

solution4 1 2011-02-25 21:03:32

solution5 0 2009-07-29 07:44:18

solution1
9 2009-07-31 19:54:52

solution2
4 2013-04-11 14:06:28

solution3
1 2011-02-25 20:18:26

solution4
1 2011-02-25 21:03:32

solution5
0 2009-07-29 07:44:18