简体   繁体   English

如何将5亿张表与另外5亿张表合并

[英]How to merge 500 million table with another 500 million table

I have to merge two 500M+ row tables. 我必须合并两个500M +行表。

What is the best method to merge them? 合并它们的最佳方法是什么?

I just need to display the records from these two SQL-Server tables if somebody searches on my webpage. 如果有人在我的网页上搜索,我只需要显示这两个SQL-Server表中的记录。

These are fixed tables, no one will ever change data in these tables once they are live. 这些是固定的表,一旦存在,就没有人会更改这些表中的数据。

create a view myview as select * from table1 union select * from table2 

Is there any harm using the above method? 使用上述方法是否有危害?

If I start merging 500M rows it will run for days and if machine reboots it will make the database go into recovery mode, and then I have to start from the beginning again. 如果我开始合并500M行,它将运行几天,如果计算机重新启动,它将使数据库进入恢复模式,然后我必须从头开始。

Why Am I merging these table? 我为什么要合并这些表?

  • I have a website which provides a search on the person table. 我有一个网站,可以在人员表上进行搜索。
  • This table have columns like Name, Address, Age etc 该表具有名称,地址,年龄等列
  • We got 500 million similar .txt files which we loaded into some other table. 我们有5亿个类似的.txt文件,已将它们加载到其他表中。
  • Now we want the website search page to query both tables to see if a person exists in the table. 现在,我们希望网站搜索页面查询两个表,以查看表中是否存在人员。
  • We get similar .txt files of 100 million or 20 million, which we load to this huge table. 我们得到了类似的.txt文件,分别为1亿或2000万,我们将其加载到这个巨大的表中。

How we are currently doing it? 我们目前的状况如何?

  • We import the .txt files into separate tables ( some columns are different in .txt) 我们将.txt文件导入到单独的表中(.txt中的某些列不同)
  • Then we arrange the columns and do the data type conversions 然后我们排列列并进行数据类型转换
  • Then insert this staging table into the liveCopy huge table ( in test environment) 然后将此登台表插入liveCopy巨大表中(在测试环境中)

We have SQL server 2008 R2 我们有SQL Server 2008 R2

  • Can we use table partitioning for performance benefits? 我们可以使用表分区来提高性能吗?
  • Is it ok to create monthly small tables and create a view on top of them? 可以创建每月的小表并在它们之上创建视图吗?
  • How can indexing be done in this case? 在这种情况下如何进行索引编制?

We only load new data once in a month and do the select 我们每个月只加载一次新数据,然后进行选择

Does replication help? 复制有帮助吗?

Biggest issue I am facing is managing huge tables. 我面临的最大问题是管理大型表。

I hope I explained the situation . 我希望我能说明情况。

Thanks & Regards 感谢和问候

1) Usually developers, to achieve more performance, are splitting large tables into smaller ones and call this as partitioning (horizontal to be more precise, because there is also vertical one). 1)通常,为了获得更高的性能,开发人员正在将大型表拆分为较小的表,并将其称为分区(水平更为精确,因为也有垂直表)。 Your view is a sample of such partitions joined. 您的视图是连接的此类分区的示例。 Of course, it is mostly used to split a large amount of data into range of values (for example, table1 contains records with column [col1] < 0, while table2 with [col1] >= 0). 当然,它通常用于将大量数据拆分为值的范围(例如,表1包含列[col1] <0,而表2包含[col1]> = 0的记录)。 But even for unsorted data it is ok too, because you get more room for speed improvements. 但是,即使对于未排序的数据也可以,因为您还有更多空间可以提高速度。 For example - parallel reads if put tables to different storages. 例如-如果将表放到不同的存储中,则并行读取。 So this is a good choice. 因此,这是一个不错的选择。

2) Another way is to use MERGE statement supported in SQL Server 2008 and higher - http://msdn.microsoft.com/en-us/library/bb510625(v=sql.100).aspx . 2)另一种方法是使用SQL Server 2008及更高版本支持的MERGE语句-http: //msdn.microsoft.com/zh-cn/library/bb510625(v=sql.100).aspx

3) Of course you can copy using INSERT+DELETE, but in this case or in case of MERGE command used do this in a small batches. 3)当然,您可以使用INSERT + DELETE进行复制,但是在这种情况下或在使用MERGE命令的情况下,请进行小批量复制。 Smth like: 像:

SET ROWCOUNT 10000
DECLARE @Count [int] = 1
WHILE @Count > 0 BEGIN
    ... INSERT+DELETE/MERGE transcation...

    SET @Count = @@ROWCOUNT
END

If your purpose is truly just to move the data from the two tables into one table, you will want to do it in batches - 100K records at a time, or something like that. 如果您的目的确实只是要将数据从两个表移动到一个表中,则将需要批量处理-一次记录10万条记录,或类似的操作。 I'd guess you crashed before because your T-Log got full, although that's just speculation. 我猜您之前曾崩溃过,因为您的T-Log已满,尽管这只是猜测。 Make sure to throw in a checkpoint after each batch if you are in Full recovery mode. 如果您处于完全恢复模式,请确保在每个批处理之后都添加一个检查点。

That said, I agree with all the comments that you should provide why you are doing this - it may not be necessary at all. 就是说,我同意您应提供的所有评论,为什么这样做-可能根本没有必要。

You may want to have a look at an Indexed View. 您可能想看看索引视图。
In this way, you can set up indexes on your view and get the best performance out of it. 这样,您可以在视图上设置索引并从中获得最佳性能。 The expensive part of using Indexed Views is in the CRUD operations - but for read performance it would be your best solution. 使用索引视图的昂贵部分是在CRUD操作中-但是对于读取性能而言,这将是您最好的解决方案。

http://www.brentozar.com/archive/2013/11/what-you-can-and-cant-do-with-indexed-views/ http://www.brentozar.com/archive/2013/11/what-you-can-and-cant-do-with-indexed-views/

https://www.simple-talk.com/sql/learn-sql-server/sql-server-indexed-views-the-basics/ https://www.simple-talk.com/sql/learn-sql-server/sql-server-indexed-views-the-basics/

If the two tables are linked one to one, then you are wasting the cpu time a lot for each read. 如果两个表一对一链接,那么每次读取都会浪费大量的CPU时间。 Especially that you mentioned that the tables don't change at all. 特别是您提到表格完全没有变化。 You should have only one table in this case. 在这种情况下,您应该只有一张桌子。 Try creating a new table including (at least) the two columns from the two tables. 尝试创建一个新表,该表至少包括两个表中的两列。 You can do this by: 您可以通过以下方式做到这一点:

Select into newTable 选择进入newTable

  from A left join B on Ax=By 

or (if some people don't have the information of the text file) 或(如果某些人没有文本文件的信息)

Select into newTable 选择进入newTable

  from A inner join B on Ax=By 

And note that you have to have made index on the join fields at least (to speed up the process). 并请注意,您必须至少在连接字段上建立索引(以加快处理速度)。

More details about the fields may help giving more precise answer as well. 有关字段的更多详细信息也可能有助于给出更准确的答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM