简体   繁体   中英

Moving data from production db to datawarehouse (SQL Server)

We are developing a reporting module for our software, and because of this we need to move some data from the system's production db into a datawarehouse db which will be used as the datasource for the reports (SQL Server reporting).

The schema in the production DB is quite old, so once we have data in the DW DB, we will need some additional fields (for example, calculating a correct datetime colum out of the prod db's 'date' and 'time' integer columns. (Don't ask, it's old.)

We are discussing internally how to do this in an efficient manner. Right now, it is implemented in a fugly SSIS job that basically tears down the entire DW DB every night and builds it up again from the prod db, doing data transformations as it goes. This doesn't scale very well.

I've been looking into using "newer" technologies, like for example SQL Server replication to move data in a more granular fashion.

My questions about this is: -With replication the "move data" part is obviously solved, but not the data transform part. I know I can create update triggers on the DW DB, but all table-related triggers seem to be wiped whenever I do a reinitialize on the subscription, which makes it hard to set up.

I'm not looking for an exact answer here, more a hint on which direction to take this. Sorry if the question is a bit blurry.

update: thanks for the good points below. This is software we're selling to customers, so I'm a big fan of having as few as possible "config items" for the customer to set up and maintain. The SSIS package as it stands today is one more "item" for the customer to keep tabs on, along with its schedules.

Replication intriguied me because it completely abscracts the whole CRUD "dilemma" when moving data, but you may be right - SSIS would still be better, as long as the SSIS logic is created a bit smarter than today.

Data might be quite large tho, so wiping and reimporting everything like we do today is definetely a problem that needs adressing .

I don't think replication is a good idea. It would be if the source and destination schemas were exactly the same, but as you pointed out, they are not. And also, all the calculations you mention the SSIS is doing, you still would have to do it because replication wouldn't.

I think SSIS is the way to go, I mean, this is exactly why it exists.

Since you are recreating the DB on each load and if the amount of calculations and changes are not big and you don't need to do things lookups to get surrogate keys from natural keys, you could create views on your main database to try to mimic the structure of the destination database so you can do direct inserts (pretty much a source component mapped to a destiantion component)

maybe if you specify what's the real issue with SSIS you want to solve, it would be easier to help.

Just a quick update on this: The CDC functionality of SQL Server seems to be what we need to look into, this functionality integrates nicely with SSIS. Thanks for the hint on Slowly Changing Dimensions, and SSIS!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM