简体   繁体   English

从c#到SQL Server的批量插入策略

[英]Bulk insert strategy from c# to SQL Server

In our current project, customers will send collection of a complex/nested messages to our system. 在我们当前的项目中,客户将向我们的系统发送复杂/嵌套消息的集合。 Frequency of these messages are approx. 这些消息的频率约为。 1000-2000 msg/per seconds. 1000-2000 msg /每秒。

These complex objects contains the transaction data (to be added) as well as master data (which will be added if not found). 这些复杂对象包含事务数据(要添加)以及主数据(如果未找到则将添加)。 But instead of passing the ids of the master data, customer passes the 'name' column. 但客户不是传递主数据的ID,而是传递“名称”列。

System checks if master data exist for these names. 系统检查这些名称是否存在主数据。 If found, it uses the ids from database otherwise create this master data first and then use these ids. 如果找到,它将使用数据库中的ID,否则首先创建此主数据,然后使用这些ID。

Once master data ids are resolved, system inserts the transactional data to a SQL Server database (using master data ids). 解析主数据ID后,系统会将事务数据插入SQL Server数据库(使用主数据ID)。 Number of master entities per message are around 15-20. 每条消息的主实体数量约为15-20。

Following are the some strategies we can adopt. 以下是我们可以采取的一些策略。

  1. We can resolve master ids first from our C# code (and insert master data if not found) and store these ids in C# cache. 我们可以首先从C#代码中解析master ID(如果没有找到则插入主数据)并将这些id存储在C#cache中。 Once all ids are resolved, we can bulk insert the transactional data using SqlBulkCopy class. 解决所有ID后,我们可以使用SqlBulkCopy类批量插入事务数据。 We can hit the database 15 times to fetch the ids for different entities and then hit database one more time to insert the final data. 我们可以访问数据库15次以获取不同实体的ID,然后再次命中数据库以插入最终数据。 We can use the same connection will close it after doing all this processing. 我们可以使用相同的连接在完成所有这些处理后关闭它。

  2. We can send all these messages containing master data and transactional data in single hit to the database (in the form of multiple TVP) and then inside stored procedure, create the master data first for the missing ones and then insert the transactional data. 我们可以将包含主数据和事务数据的所有这些消息一次性发送到数据库(以多个TVP的形式),然后在内部存储过程中,首先为缺失的数据创建主数据,然后插入事务数据。

Could anyone suggest the best approach in this use case? 有人可以建议这个用例的最佳方法吗?

Due to some privacy issue, I cannot share the actual object structure. 由于一些隐私问题,我无法分享实际的对象结构。 But here is the hypothetical object structure which is very close to our business object . 但这是假设的对象结构,它非常接近我们的业务对象

One such message will contain information about one product (its master data) and its price details (transaction data) from different vendors: 其中一条消息将包含有关一个产品(其主数据)的信息以及来自不同供应商的价格详细信息(交易数据):

Master data (which need to be added if not found) 主数据(如果未找到则需要添加)

Product name: ABC, ProductCateory: XYZ, Manufacturer: XXX and some other other details (number of properties are in the range of 15-20). 产品名称:ABC,ProductCateory:XYZ,制造商:XXX和其他一些细节(属性数量在15-20范围内)。

Transaction data (which will always be added) 交易数据(将始终添加)

Vendor Name: A, ListPrice: XXX, Discount: XXX 供应商名称:A,ListPrice:XXX,折扣:XXX

Vendor Name: B, ListPrice: XXX, Discount: XXX 供应商名称:B,ListPrice:XXX,折扣:XXX

Vendor Name: C, ListPrice: XXX, Discount: XXX 供应商名称:C,ListPrice:XXX,折扣:XXX

Vendor Name: D, ListPrice: XXX, Discount: XXX 供应商名称:D,ListPrice:XXX,折扣:XXX

Most of the information about the master data will remain the same for a message belong to one product (and will change less frequently) but transaction data will always fluctuate. 对于属于一个产品的消息,大多数有关主数据的信息将保持不变(并且将更改频率更低),但事务数据将始终波动。 So, system will check if the product 'XXX' exist in the system or not. 因此,系统将检查系统中是否存在产品“XXX”。 If not it check if the 'Category' mentioned with this product exist of not. 如果没有,请检查本产品中提到的“类别”是否存在。 If not, it will insert a new record for category and then for product. 如果没有,它将为类别插入新记录,然后为产品插入。 This will be done to for Manufacturer and other master data. 这将针对制造商和其他主数据进行。

Multiple vendors will be sending data about multiple products (2000-5000) at the same time. 多个供应商将同时发送有关多个产品(2000-5000)的数据。

So, assume that we have 1000 suppliers, Each vendor is sending data about 10-15 different products. 因此,假设我们有1000个供应商,每个供应商都在发送大约10-15种不同产品的数据。 After each 2-3 seconds, every vendor sends us the price updates of these 10 products. 每2-3秒后,每个供应商都会向我们发送这10个产品的价格更新。 He may start sending data about new products, but which will not be very frequent. 他可能会开始发送有关新产品的数据,但这种情况并不常见。

You would likely be best off with your #2 idea (ie sending all of the 15 - 20 entities to the DB in one shot using multiple TVPs and processing as a whole set of up to 2000 messages). 你可能最好用你的#2想法(即使用多个TVP一次性将所有15-20个实体发送到数据库,并处理整组最多2000条消息)。

Caching master data lookups at the app layer and translating prior to sending to the DB sounds great, but misses something: 在应用层缓存主数据查找并在发送到数据库之前进行翻译听起来很棒,但是错过了一些东西:

  1. You are going to have to hit the DB to get the initial list anyway 无论如何,您将不得不点击数据库以获取初始列表
  2. You are going to have to hit the DB to insert new entries anyway 无论如何,您将不得不点击数据库以插入新条目
  3. Looking up values in a dictionary to replace with IDs is exactly what a database does (assume a Non-Clustered Index on each of these name-to-ID lookups) 查找字典中的值以替换为ID 正是数据库的作用(假设每个这些名称到ID查找都使用非聚集索引)
  4. Frequently queried values will have their datapages cached in the buffer pool (which is a memory cache) 经常查询的值会将其数据页缓存在缓冲池(这一个内存缓存)中

Why duplicate at the app layer what is already provided and happening right now at the DB layer, especially given: 为什么在应用层重复现在在DB层提供和发生的内容 ,特别是给出:

  • The 15 - 20 entities can have up to 20k records (which is a relatively small number, especially when considering that the Non-Clustered Index only needs to be two fields: Name and ID which can pack many rows into a single data page when using a 100% Fill Factor). 15-20个实体可以有多达20k的记录(这是一个相对较小的数字,特别是考虑到非聚集索引只需要两个字段: NameID ,可以在使用时将许多行打包到单个数据页中100%填充因子)。
  • Not all 20k entries are "active" or "current", so you don't need to worry about caching all of them. 并非所有20k条目都是“活动”或“当前”,因此您无需担心缓存所有条目。 So whatever values are current will be easily identified as the ones being queried, and those data pages (which may include some inactive entries, but no big deal there) will be the ones to get cached in the Buffer Pool. 因此,无论当前值是什么值都将被轻易识别为被查询的值,并且那些数据页(可能包括一些非活动条目,但在那里没有大问题)将被缓存在缓冲池中。

Hence, you don't need to worry about aging out old entries OR forcing any key expirations or reloads due to possibly changing values (ie updated Name for a particular ID ) as that is handled naturally. 因此,您无需担心老条目的老化或由于可能更改的值(即特定ID更新Name )而导致任何密钥到期或重新加载,因为这是自然处理的。

Yes, in-memory caching is wonderful technology and greatly speeds up websites, but those scenarios / use-cases are for when non-database processes are requesting the same data over and over in pure read-only purposes. 是的,内存中缓存是一种很棒的技术,可以大大加快网站的速度,但是这些场景/用例是指非数据库进程在纯粹的只读目的中反复请求相同的数据。 But this particular scenario is one in which data is being merged and the list of lookup values can be changing frequently (moreso due to new entries than due to updated entries). 但是这种特殊情况是合并数据并且查找值列表可能频繁更改(更多因为新条目而不是更新条目)。


That all being said, Option #2 is the way to go. 总而言之,选项#2是要走的路。 I have done this technique several times with much success, though not with 15 TVPs. 虽然没有15个TVP,但我已经多次成功完成了这项技术。 It might be that some optimizations / adjustments need to be made to the method to tune this particular situation, but what I have found to work well is: 可能需要对方法进行一些优化/调整以调整这种特定情况,但我发现效果很好的是:

  • Accept the data via TVP. 通过TVP接受数据。 I prefer this over SqlBulkCopy because: 我比SqlBulkCopy更喜欢这个,因为:
    • it makes for an easily self-contained Stored Procedure 它使得一个易于自包含的存储过程成为可能
    • it fits very nicely into the app code to fully stream the collection(s) to the DB without needing to copy the collection(s) to a DataTable first, which is duplicating the collection, which is wasting CPU and memory. 它非常适合应用程序代码,可以将集合完全流式传输到数据库,而无需首先将集合复制到DataTable ,这会复制集合,这会浪费CPU和内存。 This requires that you create a method per each collection that returns IEnumerable<SqlDataRecord> , accepts the collection as input, and uses yield return; 这要求您为每个返回IEnumerable<SqlDataRecord>的集合创建一个方法,接受集合作为输入,并使用yield return; to send each record in the for or foreach loop. 发送forforeach循环中的每条记录。
  • TVPs are not great for statistics and hence not great for JOINing to (though this can be mitigated by using a TOP (@RecordCount) in the queries), but you don't need to worry about that anyway since they are only used to populate the real tables with any missing values TVP不适合统计,因此不适合JOINing(尽管可以通过在查询中使用TOP (@RecordCount)来减轻这种情况),但无论如何你都不需要担心,因为它们只用于填充具有任何缺失值的真实表
  • Step 1: Insert missing Names for each entity. 第1步:为每个实体插入缺少的名称。 Remember that there should be a NonClustered Index on the [Name] field for each entity, and assuming that the ID is the Clustered Index, that value will naturally be a part of the index, hence [Name] only will provide a covering index in addition to helping the following operation. 请记住,每个实体的[Name]字段都应该有一个NonClustered Index,并且假设该ID是Clustered Index,该值自然会成为索引的一部分,因此[Name]仅提供覆盖索引除了帮助以下操作。 And also remember that any prior executions for this client (ie roughly the same entity values) will cause the data pages for these indexes to remain cached in the Buffer Pool (ie memory). 并且还要记住,此客户端的任何先前执行(即大致相同的实体值)将导致这些索引的数据页保持缓存在缓冲池(即内存)中。

     ;WITH cte AS ( SELECT DISTINCT tmp.[Name] FROM @EntityNumeroUno tmp ) INSERT INTO EntityNumeroUno ([Name]) SELECT cte.[Name] FROM cte WHERE NOT EXISTS( SELECT * FROM EntityNumeroUno tab WHERE tab.[Name] = cte.[Name] ) 
  • Step 2: INSERT all of the "messages" in simple INSERT...SELECT where the data pages for the lookup tables (ie the "entities") are already cached in the Buffer Pool due to Step 1 步骤2:在简单的INSERT...SELECT中插入所有“消息”,其中由于步骤1,查找表的数据页(即“实体”)已经缓存在缓冲池中


Finally, keep in mind that conjecture / assumptions / educated guesses are no substitute for testing. 最后,请记住,猜测/假设/有根据的猜测不能替代测试。 You need to try a few methods to see what works best for your particular situation since there might be additional details that have not been shared that could influence what is considered "ideal" here. 您需要尝试一些方法来查看哪种方法最适合您的特定情况,因为可能还有其他未共享的详细信息可能会影响此处的“理想”。

I will say that if the Messages are insert-only, then Vlad's idea might be faster. 我会说,如果消息只是插入,那么弗拉德的想法可能会更快。 The method I am describing here I have used in situations that were more complex and required full syncing (updates and deletes) and did additional validations and creation of related operational data (not lookup values). 我在这里描述的方法我已经在更复杂的情况下使用,需要完全同步(更新和删除),并进行了额外的验证和相关操作数据的创建(而不是查找值)。 Using SqlBulkCopy might be faster on straight inserts (though for only 2000 records I doubt there is much difference if any at all), but this assumes you are loading directly to the destination tables (messages and lookups) and not into intermediary / staging tables (and I believe Vlad's idea is to SqlBulkCopy directly to the destination tables). 在直接插入时使用SqlBulkCopy 可能会更快(尽管只有2000条记录,我怀疑是否存在很大差异),但这假设您直接加载到目标表(消息和查找)而不是中间/临时表(我相信Vlad的想法是将SqlBulkCopy直接发送到目标表)。 However, as stated above, using an external cache (ie not the Buffer Pool) is also more error prone due to the issue of updating lookup values. 然而,如上所述,由于更新查找值的问题,使用外部高速缓存(即不是缓冲池)也更容易出错。 It could take more code than it's worth to account for invalidating an external cache, especially if using an external cache is only marginally faster. 它可能需要更多的代码来考虑使外部缓存无效,特别是如果使用外部缓存只是稍微快一些。 That additional risk / maintenance needs to be factored into which method is overall better for your needs. 需要考虑额外的风险/维护,哪种方法总体上更好地满足您的需求。


UPDATE UPDATE

Based on info provided in comments, we now know: 根据评论中提供的信息,我们现在知道:

  • There are multiple Vendors 有多个供应商
  • There are multiple Products offered by each Vendor 每个供应商提供多种产品
  • Products are not unique to a Vendor; 产品并非供应商所独有; Products are sold by 1 or more Vendors 产品由1个或更多供应商销售
  • Product properties are singular 产品属性是单一的
  • Pricing info has properties that can have multiple records 定价信息具有可以包含多个记录的属性
  • Pricing info is INSERT-only (ie point-in-time history) 定价信息仅限INSERT(即时间点历史记录)
  • Unique Product is determined by SKU (or similar field) 唯一产品由SKU(或类似领域)确定
  • Once created, a Product coming through with an existing SKU but different properties otherwise (eg category, manufacturer, etc) will be considered the same Product ; 一旦创建,使用现有SKU但不同属性(例如类别,制造商等)的产品将被视为同一产品 ; the differences will be ignored 差异将被忽略

With all of this in mind, I will still recommend TVPs, but to re-think the approach and make it Vendor-centric, not Product-centric. 考虑到所有这些,我仍然会推荐TVP,但要重新思考这种方法并使其以供应商为中心,而不是以产品为中心。 The assumption here is that Vendor's send files whenever. 这里的假设是供应商随时发送文件。 So when you get a file, import it. 所以当你得到一个文件时,导入它。 The only lookup you would be doing ahead of time is the Vendor. 您提前做的唯一查询是供应商。 Here is the basic layout: 这是基本布局:

  1. Seems reasonable to assume that you already have a VendorID at this point because why would the system be importing a file from an unknown source? 似乎有理由假设您此时已经有VendorID,因为系统为什么要从未知来源导入文件?
  2. You can import in batches 您可以批量导入
  3. Create a SendRows method that: 创建一个SendRows方法:
    • accepts a FileStream or something that allows for advancing through a file 接受FileStream或允许通过文件前进的东西
    • accepts something like int BatchSize 接受类似int BatchSize东西
    • returns IEnumerable<SqlDataRecord> 返回IEnumerable<SqlDataRecord>
    • creates a SqlDataRecord to match the TVP structure 创建一个SqlDataRecord以匹配TVP结构
    • for loops though the FileStream until either BatchSize has been met or no more records in the File for循环通过FileStream直到满足BatchSize或文件中没有更多记录
    • perform any necessary validations on the data 对数据执行任何必要的验证
    • map the data to the SqlDataRecord 将数据映射到SqlDataRecord
    • call yield return; 呼叫yield return;
  4. Open the file 打开文件
  5. While there is data in the file 虽然文件中有数据
    • call the stored proc 调用存储过程
    • pass in VendorID 传递VendorID
    • pass in SendRows(FileStream, BatchSize) for the TVP 传入TVP的SendRows(FileStream, BatchSize)
  6. Close the file 关闭文件
  7. Experiment with: 试验:
    • opening the SqlConnection before the loop around the FileStream and closing it after the loops are done 在围绕FileStream循环之前打开SqlConnection,并在循环完成后关闭它
    • Opening the SqlConnection, executing the stored procedure, and closing the SqlConnection inside of the FileStream loop 打开SqlConnection,执行存储过程,并关闭FileStream循环内部的SqlConnection
  8. Experiment with various BatchSize values. 试验各种BatchSize值。 Start at 100, then 200, 500, etc. 从100开始,然后是200,500等。
  9. The stored proc will handle inserting new Products 存储过程将处理插入新产品

Using this type of structure you will be sending in Product properties that are not used (ie only the SKU is used for the look up of existing Products). 使用这种类型的结构,您将发送未使用的产品属性(即仅使用SKU查找现有产品)。 BUT, it scales very well as there is no upper-bound regarding file size. 但是,它的扩展非常好,因为文件大小没有上限。 If the Vendor sends 50 Products, fine. 如果卖方发送50个产品,那很好。 If they send 50k Products, fine. 如果他们发送50k产品,罚款。 If they send 4 million Products (which is the system I worked on and it did handle updating Product info that was different for any of its properties!), then fine. 如果他们发送400万个产品(这是我工作的系统,它确实处理了更新任何属性的产品信息!),那么很好。 No increase in memory at the app layer or DB layer to handle even 10 million Products. 应用层或数据库层的内存不会增加,甚至不能处理1000万个产品。 The time the import takes should increase in step with the amount of Products sent. 导入所用的时间应随着发送的产品数量的增加而增加。


UPDATE 2 更新2
New details related to Source data: 与源数据相关的新详细信息:

  • comes from Azure EventHub 来自Azure EventHub
  • comes in the form of C# objects (no files) 以C#对象的形式出现(没有文件)
  • Product details come in through OP's system's APIs 产品详细信息通过OP系统的API提供
  • is collected in single queue (just pull data out insert into database) 收集在单个队列中(只需将数据拉出插入数据库)

If the data source is C# objects then I would most definitely use TVPs as you can send them over as is via the method I described in my first update (ie a method that returns IEnumerable<SqlDataRecord> ). 如果数据源是C#对象,那么我肯定会使用TVP,因为你可以通过我在第一次更新中描述的方法(即返回IEnumerable<SqlDataRecord> )发送它们。 Send one or more TVPs for the Price/Offer per Vendor details but regular input params for the singular Property attributes. 针对每个供应商的价格/优惠详细信息发送一个或多个TVP,但针对单个属性属性定期输入参数。 For example: 例如:

CREATE PROCEDURE dbo.ImportProduct
(
  @SKU             VARCHAR(50),
  @ProductName     NVARCHAR(100),
  @Manufacturer    NVARCHAR(100),
  @Category        NVARCHAR(300),
  @VendorPrices    dbo.VendorPrices READONLY,
  @DiscountCoupons dbo.DiscountCoupons READONLY
)
SET NOCOUNT ON;

-- Insert Product if it doesn't already exist
IF (NOT EXISTS(
         SELECT  *
         FROM    dbo.Products pr
         WHERE   pr.SKU = @SKU
              )
   )
BEGIN
  INSERT INTO dbo.Products (SKU, ProductName, Manufacturer, Category, ...)
  VALUES (@SKU, @ProductName, @Manufacturer, @Category, ...);
END;

...INSERT data from TVPs
-- might need OPTION (RECOMPILE) per each TVP query to ensure proper estimated rows

From a DB point of view, there's no such fast thing than BULK INSERT (from csv files for example). 从数据库的角度来看,没有比BULK INSERT快的东西(例如来自csv文件)。 The best is to bulk all data asap, then process it with stored procedures. 最好是尽快批量处理所有数据,然后使用存储过程对其进行处理。

AC# layer will just slow down the process, since all the queries between C# and SQL will be thousands times slower than what Sql-Server can directly handle. AC#层只会减慢进程,因为C#和SQL之间的所有查询都比Sql-Server可以直接处理的慢几千倍。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM