[英]Bulk insert strategy from c# to SQL Server
In our current project, customers will send collection of a complex/nested messages to our system. 在我们当前的项目中,客户将向我们的系统发送复杂/嵌套消息的集合。 Frequency of these messages are approx.
这些消息的频率约为。 1000-2000 msg/per seconds.
1000-2000 msg /每秒。
These complex objects contains the transaction data (to be added) as well as master data (which will be added if not found). 这些复杂对象包含事务数据(要添加)以及主数据(如果未找到则将添加)。 But instead of passing the ids of the master data, customer passes the 'name' column.
但客户不是传递主数据的ID,而是传递“名称”列。
System checks if master data exist for these names. 系统检查这些名称是否存在主数据。 If found, it uses the ids from database otherwise create this master data first and then use these ids.
如果找到,它将使用数据库中的ID,否则首先创建此主数据,然后使用这些ID。
Once master data ids are resolved, system inserts the transactional data to a SQL Server database (using master data ids). 解析主数据ID后,系统会将事务数据插入SQL Server数据库(使用主数据ID)。 Number of master entities per message are around 15-20.
每条消息的主实体数量约为15-20。
Following are the some strategies we can adopt. 以下是我们可以采取的一些策略。
We can resolve master ids first from our C# code (and insert master data if not found) and store these ids in C# cache. 我们可以首先从C#代码中解析master ID(如果没有找到则插入主数据)并将这些id存储在C#cache中。 Once all ids are resolved, we can bulk insert the transactional data using
SqlBulkCopy
class. 解决所有ID后,我们可以使用
SqlBulkCopy
类批量插入事务数据。 We can hit the database 15 times to fetch the ids for different entities and then hit database one more time to insert the final data. 我们可以访问数据库15次以获取不同实体的ID,然后再次命中数据库以插入最终数据。 We can use the same connection will close it after doing all this processing.
我们可以使用相同的连接在完成所有这些处理后关闭它。
We can send all these messages containing master data and transactional data in single hit to the database (in the form of multiple TVP) and then inside stored procedure, create the master data first for the missing ones and then insert the transactional data. 我们可以将包含主数据和事务数据的所有这些消息一次性发送到数据库(以多个TVP的形式),然后在内部存储过程中,首先为缺失的数据创建主数据,然后插入事务数据。
Could anyone suggest the best approach in this use case? 有人可以建议这个用例的最佳方法吗?
Due to some privacy issue, I cannot share the actual object structure. 由于一些隐私问题,我无法分享实际的对象结构。 But here is the hypothetical object structure which is very close to our business object .
但这是假设的对象结构,它非常接近我们的业务对象 。
One such message will contain information about one product (its master data) and its price details (transaction data) from different vendors: 其中一条消息将包含有关一个产品(其主数据)的信息以及来自不同供应商的价格详细信息(交易数据):
Master data (which need to be added if not found) 主数据(如果未找到则需要添加)
Product name: ABC, ProductCateory: XYZ, Manufacturer: XXX and some other other details (number of properties are in the range of 15-20). 产品名称:ABC,ProductCateory:XYZ,制造商:XXX和其他一些细节(属性数量在15-20范围内)。
Transaction data (which will always be added) 交易数据(将始终添加)
Vendor Name: A, ListPrice: XXX, Discount: XXX 供应商名称:A,ListPrice:XXX,折扣:XXX
Vendor Name: B, ListPrice: XXX, Discount: XXX 供应商名称:B,ListPrice:XXX,折扣:XXX
Vendor Name: C, ListPrice: XXX, Discount: XXX 供应商名称:C,ListPrice:XXX,折扣:XXX
Vendor Name: D, ListPrice: XXX, Discount: XXX 供应商名称:D,ListPrice:XXX,折扣:XXX
Most of the information about the master data will remain the same for a message belong to one product (and will change less frequently) but transaction data will always fluctuate. 对于属于一个产品的消息,大多数有关主数据的信息将保持不变(并且将更改频率更低),但事务数据将始终波动。 So, system will check if the product 'XXX' exist in the system or not.
因此,系统将检查系统中是否存在产品“XXX”。 If not it check if the 'Category' mentioned with this product exist of not.
如果没有,请检查本产品中提到的“类别”是否存在。 If not, it will insert a new record for category and then for product.
如果没有,它将为类别插入新记录,然后为产品插入。 This will be done to for Manufacturer and other master data.
这将针对制造商和其他主数据进行。
Multiple vendors will be sending data about multiple products (2000-5000) at the same time. 多个供应商将同时发送有关多个产品(2000-5000)的数据。
So, assume that we have 1000 suppliers, Each vendor is sending data about 10-15 different products. 因此,假设我们有1000个供应商,每个供应商都在发送大约10-15种不同产品的数据。 After each 2-3 seconds, every vendor sends us the price updates of these 10 products.
每2-3秒后,每个供应商都会向我们发送这10个产品的价格更新。 He may start sending data about new products, but which will not be very frequent.
他可能会开始发送有关新产品的数据,但这种情况并不常见。
You would likely be best off with your #2 idea (ie sending all of the 15 - 20 entities to the DB in one shot using multiple TVPs and processing as a whole set of up to 2000 messages). 你可能最好用你的#2想法(即使用多个TVP一次性将所有15-20个实体发送到数据库,并处理整组最多2000条消息)。
Caching master data lookups at the app layer and translating prior to sending to the DB sounds great, but misses something: 在应用层缓存主数据查找并在发送到数据库之前进行翻译听起来很棒,但是错过了一些东西:
Why duplicate at the app layer what is already provided and happening right now at the DB layer, especially given: 为什么在应用层重复现在在DB层提供和发生的内容 ,特别是给出:
Name
and ID
which can pack many rows into a single data page when using a 100% Fill Factor). Name
和ID
,可以在使用时将许多行打包到单个数据页中100%填充因子)。 Hence, you don't need to worry about aging out old entries OR forcing any key expirations or reloads due to possibly changing values (ie updated Name
for a particular ID
) as that is handled naturally. 因此,您无需担心老条目的老化或由于可能更改的值(即特定
ID
更新Name
)而导致任何密钥到期或重新加载,因为这是自然处理的。
Yes, in-memory caching is wonderful technology and greatly speeds up websites, but those scenarios / use-cases are for when non-database processes are requesting the same data over and over in pure read-only purposes. 是的,内存中缓存是一种很棒的技术,可以大大加快网站的速度,但是这些场景/用例是指非数据库进程在纯粹的只读目的中反复请求相同的数据。 But this particular scenario is one in which data is being merged and the list of lookup values can be changing frequently (moreso due to new entries than due to updated entries).
但是这种特殊情况是合并数据并且查找值列表可能频繁更改(更多因为新条目而不是更新条目)。
That all being said, Option #2 is the way to go. 总而言之,选项#2是要走的路。 I have done this technique several times with much success, though not with 15 TVPs.
虽然没有15个TVP,但我已经多次成功完成了这项技术。 It might be that some optimizations / adjustments need to be made to the method to tune this particular situation, but what I have found to work well is:
可能需要对方法进行一些优化/调整以调整这种特定情况,但我发现效果很好的是:
SqlBulkCopy
because: SqlBulkCopy
更喜欢这个,因为:
DataTable
first, which is duplicating the collection, which is wasting CPU and memory. DataTable
,这会复制集合,这会浪费CPU和内存。 This requires that you create a method per each collection that returns IEnumerable<SqlDataRecord>
, accepts the collection as input, and uses yield return;
IEnumerable<SqlDataRecord>
的集合创建一个方法,接受集合作为输入,并使用yield return;
to send each record in the for
or foreach
loop. for
或foreach
循环中的每条记录。 TOP (@RecordCount)
in the queries), but you don't need to worry about that anyway since they are only used to populate the real tables with any missing values TOP (@RecordCount)
来减轻这种情况),但无论如何你都不需要担心,因为它们只用于填充具有任何缺失值的真实表 Step 1: Insert missing Names for each entity. 第1步:为每个实体插入缺少的名称。 Remember that there should be a NonClustered Index on the
[Name]
field for each entity, and assuming that the ID is the Clustered Index, that value will naturally be a part of the index, hence [Name]
only will provide a covering index in addition to helping the following operation. 请记住,每个实体的
[Name]
字段都应该有一个NonClustered Index,并且假设该ID是Clustered Index,该值自然会成为索引的一部分,因此[Name]
仅提供覆盖索引除了帮助以下操作。 And also remember that any prior executions for this client (ie roughly the same entity values) will cause the data pages for these indexes to remain cached in the Buffer Pool (ie memory). 并且还要记住,此客户端的任何先前执行(即大致相同的实体值)将导致这些索引的数据页保持缓存在缓冲池(即内存)中。
;WITH cte AS ( SELECT DISTINCT tmp.[Name] FROM @EntityNumeroUno tmp ) INSERT INTO EntityNumeroUno ([Name]) SELECT cte.[Name] FROM cte WHERE NOT EXISTS( SELECT * FROM EntityNumeroUno tab WHERE tab.[Name] = cte.[Name] )
Step 2: INSERT all of the "messages" in simple INSERT...SELECT
where the data pages for the lookup tables (ie the "entities") are already cached in the Buffer Pool due to Step 1 步骤2:在简单的
INSERT...SELECT
中插入所有“消息”,其中由于步骤1,查找表的数据页(即“实体”)已经缓存在缓冲池中
Finally, keep in mind that conjecture / assumptions / educated guesses are no substitute for testing. 最后,请记住,猜测/假设/有根据的猜测不能替代测试。 You need to try a few methods to see what works best for your particular situation since there might be additional details that have not been shared that could influence what is considered "ideal" here.
您需要尝试一些方法来查看哪种方法最适合您的特定情况,因为可能还有其他未共享的详细信息可能会影响此处的“理想”。
I will say that if the Messages are insert-only, then Vlad's idea might be faster. 我会说,如果消息只是插入,那么弗拉德的想法可能会更快。 The method I am describing here I have used in situations that were more complex and required full syncing (updates and deletes) and did additional validations and creation of related operational data (not lookup values).
我在这里描述的方法我已经在更复杂的情况下使用,需要完全同步(更新和删除),并进行了额外的验证和相关操作数据的创建(而不是查找值)。 Using
SqlBulkCopy
might be faster on straight inserts (though for only 2000 records I doubt there is much difference if any at all), but this assumes you are loading directly to the destination tables (messages and lookups) and not into intermediary / staging tables (and I believe Vlad's idea is to SqlBulkCopy
directly to the destination tables). 在直接插入时使用
SqlBulkCopy
可能会更快(尽管只有2000条记录,我怀疑是否存在很大差异),但这假设您直接加载到目标表(消息和查找)而不是中间/临时表(我相信Vlad的想法是将SqlBulkCopy
直接发送到目标表)。 However, as stated above, using an external cache (ie not the Buffer Pool) is also more error prone due to the issue of updating lookup values. 然而,如上所述,由于更新查找值的问题,使用外部高速缓存(即不是缓冲池)也更容易出错。 It could take more code than it's worth to account for invalidating an external cache, especially if using an external cache is only marginally faster.
它可能需要更多的代码来考虑使外部缓存无效,特别是如果使用外部缓存只是稍微快一些。 That additional risk / maintenance needs to be factored into which method is overall better for your needs.
需要考虑额外的风险/维护,哪种方法总体上更好地满足您的需求。
UPDATE UPDATE
Based on info provided in comments, we now know: 根据评论中提供的信息,我们现在知道:
With all of this in mind, I will still recommend TVPs, but to re-think the approach and make it Vendor-centric, not Product-centric. 考虑到所有这些,我仍然会推荐TVP,但要重新思考这种方法并使其以供应商为中心,而不是以产品为中心。 The assumption here is that Vendor's send files whenever.
这里的假设是供应商随时发送文件。 So when you get a file, import it.
所以当你得到一个文件时,导入它。 The only lookup you would be doing ahead of time is the Vendor.
您提前做的唯一查询是供应商。 Here is the basic layout:
这是基本布局:
SendRows
method that: SendRows
方法:
int BatchSize
int BatchSize
东西 IEnumerable<SqlDataRecord>
IEnumerable<SqlDataRecord>
SqlDataRecord
to match the TVP structure SqlDataRecord
以匹配TVP结构 SqlDataRecord
SqlDataRecord
yield return;
yield return;
SendRows(FileStream, BatchSize)
for the TVP SendRows(FileStream, BatchSize)
Using this type of structure you will be sending in Product properties that are not used (ie only the SKU is used for the look up of existing Products). 使用这种类型的结构,您将发送未使用的产品属性(即仅使用SKU查找现有产品)。 BUT, it scales very well as there is no upper-bound regarding file size.
但是,它的扩展非常好,因为文件大小没有上限。 If the Vendor sends 50 Products, fine.
如果卖方发送50个产品,那很好。 If they send 50k Products, fine.
如果他们发送50k产品,罚款。 If they send 4 million Products (which is the system I worked on and it did handle updating Product info that was different for any of its properties!), then fine.
如果他们发送400万个产品(这是我工作的系统,它确实处理了更新任何属性的产品信息!),那么很好。 No increase in memory at the app layer or DB layer to handle even 10 million Products.
应用层或数据库层的内存不会增加,甚至不能处理1000万个产品。 The time the import takes should increase in step with the amount of Products sent.
导入所用的时间应随着发送的产品数量的增加而增加。
UPDATE 2 更新2
New details related to Source data: 与源数据相关的新详细信息:
If the data source is C# objects then I would most definitely use TVPs as you can send them over as is via the method I described in my first update (ie a method that returns IEnumerable<SqlDataRecord>
). 如果数据源是C#对象,那么我肯定会使用TVP,因为你可以通过我在第一次更新中描述的方法(即返回
IEnumerable<SqlDataRecord>
)发送它们。 Send one or more TVPs for the Price/Offer per Vendor details but regular input params for the singular Property attributes. 针对每个供应商的价格/优惠详细信息发送一个或多个TVP,但针对单个属性属性定期输入参数。 For example:
例如:
CREATE PROCEDURE dbo.ImportProduct
(
@SKU VARCHAR(50),
@ProductName NVARCHAR(100),
@Manufacturer NVARCHAR(100),
@Category NVARCHAR(300),
@VendorPrices dbo.VendorPrices READONLY,
@DiscountCoupons dbo.DiscountCoupons READONLY
)
SET NOCOUNT ON;
-- Insert Product if it doesn't already exist
IF (NOT EXISTS(
SELECT *
FROM dbo.Products pr
WHERE pr.SKU = @SKU
)
)
BEGIN
INSERT INTO dbo.Products (SKU, ProductName, Manufacturer, Category, ...)
VALUES (@SKU, @ProductName, @Manufacturer, @Category, ...);
END;
...INSERT data from TVPs
-- might need OPTION (RECOMPILE) per each TVP query to ensure proper estimated rows
From a DB point of view, there's no such fast thing than BULK INSERT (from csv files for example). 从数据库的角度来看,没有比BULK INSERT快的东西(例如来自csv文件)。 The best is to bulk all data asap, then process it with stored procedures.
最好是尽快批量处理所有数据,然后使用存储过程对其进行处理。
AC# layer will just slow down the process, since all the queries between C# and SQL will be thousands times slower than what Sql-Server can directly handle. AC#层只会减慢进程,因为C#和SQL之间的所有查询都比Sql-Server可以直接处理的慢几千倍。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.