简体   繁体   English

从包中两个相关表中下载行的最快方法是什么?

[英]What is the fastest way to download rows from two related tables in packs?

I have some problems with two massive related tables. 两个大型相关表存在一些问题。 First one has about 100 columns, second one about 300. The foreign key is on 5 columns. 第一个大约有100列,第二个大约有300列。外键在5列上。 100 million rows is nothing special in these tables. 1亿行在这些表中没什么特别的。

My task is to download all the rows, process them and then upload. 我的任务是下载所有行,对其进行处理,然后上载。 I need to download these tables in packs (10000 rows from parent table + all related to them rows from second table). 我需要打包下载这些表(父表中的10000行+第二表中的所有相关行)。

What would be the fastest way to do it? 最快的方法是什么?

Simplest solution would be downloading 10000 rows from parent table and then iterating through them to download related rows. 最简单的解决方案是从父表中下载10000行,然后遍历它们以下载相关行。 Simple but I don't think it will be fast. 很简单,但我认为不会很快。

Other solution could be download with joining those two tables. 连接这两个表可以下载其他解决方案。 Problem is that then I have to separate row in two parts, eliminate duplicates, etc. I also don't really know how fast this download would be. 问题在于,然后我必须将行分成两部分,消除重复项,等等。我也不是很清楚此下载的速度。

So, my question is the same as in the title. 因此,我的问题与标题中的问题相同。 What is the fastest way to download massive ammount of data from related tables in packs? 从数据包中的相关表中下载大量数据的最快方法是什么?

I think the best solution here is to firstly download all the rows you require. 我认为最好的解决方案是首先下载所需的所有行。 SO all 1mil rows and put those into a List where Type is the actual type of the Table. 因此,将所有1mil行都放入表中,其中Type是表的实际类型。 This is easily done using a framework like NHibernate where you can map database structure to classes. 使用类似NHibernate的框架可以轻松完成此操作,在该框架中您可以将数据库结构映射到类。

Once you have that then you can proceed by something like this: You have a number of batches lets say 10,000 per batch. 一旦有了,就可以进行如下操作:您有许多批次,比如说每批次10,000。

    int totalCount = LIST.Count;
                int batchSize = 10000;
                int numberOfBatches = (int)Math.Ceiling((decimal)totalCount / batchSize);

     for (int i = 0; i < numberOfBatches; i++)

                    {
    var currentBatch = LIST.Skip(i * batchSize).Take(batchSize);

CONTENT HERE.

}

So basically you will be updating x number of rows at a time in the database. 因此,基本上,您将一次在数据库中更新x行数。 I highly suggest you use NHibernate as opposed to SQLReader/Writers as they are much more efficient and tasks like insert/update/delete become trivial. 我强烈建议您使用NHibernate而不是SQLReader / Writers,因为它们效率更高,并且诸如插入/更新/删除之类的任务变得微不足道。

EDIT: Alternatively to NHibernate Update you can use BulkUpdate Have a look at Bulk Update in C# . 编辑:替代NHibernate Update,您可以使用BulkUpdate看看C#中的Bulk Update

The fastest way would be to use an ETL tool like SSIS to process the data on the server without transferring it to other machines. 最快的方法是使用SSIS之类的ETL工具来处理服务器上的数据,而无需将其传输到其他计算机。

SSIS allows batching, per-row processing of data streams with many built-in operations or even C# scripts, execution monitoring, handling of dirty data etc. SSIS允许使用许多内置操作甚至C#脚本进行批处理,按行处理数据流,执行监视,处理脏数据等。

In ETL scenarios IO is the big killer, so transferring the data to other machines should be avoided. 在ETL场景中,IO是最大的杀手,因此应避免将数据传输到其他计算机。 Connection latencies are another killer, so retrieving child record from a client machine is also going to kill performance. 连接等待时间是另一个杀手,因此从客户端计算机检索子记录也会降低性能。

A proper SQL statement like a join between two tables will perform orders of magnitude better than pulling the data to some other machine then pushing it back. 适当的SQL语句(例如两个表之间的联接)的性能要比将数据拉到其他计算机然后将其推回要好几个数量级。 Moreover, the database can optimize large selects and updates because it can select the proper query and update strategies and uses only the data that is actually used in selects or updates. 此外,数据库可以优化大型选择和更新,因为它可以选择适当的查询和更新策略,并且仅使用选择或更新中实际使用的数据。

Finally, ORMs like NHibernate, EF or Linq to SQL should be avoided at all costs in ETL scenarios with even modest data sizes. 最后,在数据大小适中的ETL场景中,应不惜一切代价避免使用NHibernate,EF或Linq to SQL之类的ORM。 The CPU and memory overhead of mapping thousands (much less millions) of objects is significant without providing any benefit. 映射数千个(少得多的)对象的CPU和内存开销非常大,而没有提供任何好处。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM