How to merge 500 million table with another 500 million table

Question

I have to merge two 500M+ row tables.

What is the best method to merge them?

I just need to display the records from these two SQL-Server tables if somebody searches on my webpage.

These are fixed tables, no one will ever change data in these tables once they are live.

create a view myview as select * from table1 union select * from table2

Is there any harm using the above method?

If I start merging 500M rows it will run for days and if machine reboots it will make the database go into recovery mode, and then I have to start from the beginning again.

Why Am I merging these table?

I have a website which provides a search on the person table.
This table have columns like Name, Address, Age etc
We got 500 million similar .txt files which we loaded into some other table.
Now we want the website search page to query both tables to see if a person exists in the table.
We get similar .txt files of 100 million or 20 million, which we load to this huge table.

How we are currently doing it?

We import the .txt files into separate tables ( some columns are different in .txt)
Then we arrange the columns and do the data type conversions
Then insert this staging table into the liveCopy huge table ( in test environment)

We have SQL server 2008 R2

Can we use table partitioning for performance benefits?
Is it ok to create monthly small tables and create a view on top of them?
How can indexing be done in this case?

We only load new data once in a month and do the select

Does replication help?

Biggest issue I am facing is managing huge tables.

I hope I explained the situation .

Thanks & Regards

Answer 1

1) Usually developers, to achieve more performance, are splitting large tables into smaller ones and call this as partitioning (horizontal to be more precise, because there is also vertical one). Your view is a sample of such partitions joined. Of course, it is mostly used to split a large amount of data into range of values (for example, table1 contains records with column [col1] < 0, while table2 with [col1] >= 0). But even for unsorted data it is ok too, because you get more room for speed improvements. For example - parallel reads if put tables to different storages. So this is a good choice.

2) Another way is to use MERGE statement supported in SQL Server 2008 and higher - http://msdn.microsoft.com/en-us/library/bb510625(v=sql.100).aspx .

3) Of course you can copy using INSERT+DELETE, but in this case or in case of MERGE command used do this in a small batches. Smth like:

SET ROWCOUNT 10000
DECLARE @Count [int] = 1
WHILE @Count > 0 BEGIN
    ... INSERT+DELETE/MERGE transcation...

    SET @Count = @@ROWCOUNT
END

Answer 2

If your purpose is truly just to move the data from the two tables into one table, you will want to do it in batches - 100K records at a time, or something like that. I'd guess you crashed before because your T-Log got full, although that's just speculation. Make sure to throw in a checkpoint after each batch if you are in Full recovery mode.

That said, I agree with all the comments that you should provide why you are doing this - it may not be necessary at all.

Answer 3

You may want to have a look at an Indexed View.
In this way, you can set up indexes on your view and get the best performance out of it. The expensive part of using Indexed Views is in the CRUD operations - but for read performance it would be your best solution.

http://www.brentozar.com/archive/2013/11/what-you-can-and-cant-do-with-indexed-views/

https://www.simple-talk.com/sql/learn-sql-server/sql-server-indexed-views-the-basics/

Answer 4

If the two tables are linked one to one, then you are wasting the cpu time a lot for each read. Especially that you mentioned that the tables don't change at all. You should have only one table in this case. Try creating a new table including (at least) the two columns from the two tables. You can do this by:

Select into newTable
  from A left join B on Ax=By 

or (if some people don't have the information of the text file)

Select into newTable
  from A inner join B on Ax=By 

And note that you have to have made index on the join fields at least (to speed up the process).

More details about the fields may help giving more precise answer as well.

How to merge 500 million table with another 500 million table

Question

4 answers

solution1
2 2014-06-09 15:12:49

solution2
1 ACCPTED 2014-06-09 15:10:03

solution3
1 2014-06-09 15:14:22

solution4
1 2014-06-09 17:28:17

How to merge 500 million table with another 500 million table

Question

4 answers

solution1 2 2014-06-09 15:12:49

solution2 1 ACCPTED 2014-06-09 15:10:03

solution3 1 2014-06-09 15:14:22

solution4 1 2014-06-09 17:28:17

solution1
2 2014-06-09 15:12:49

solution2
1 ACCPTED 2014-06-09 15:10:03

solution3
1 2014-06-09 15:14:22

solution4
1 2014-06-09 17:28:17