跨 EC Core 上下文同步大型数据库表的最佳方法是什么？

Question

My Scenario我的情景

I have three warehouse databases (Firebird) numbered 1 , 2 and 3 , each sharing the same scheme, and the same DbContext class.我有三个仓库数据库（Firebird），编号为1 、 2和3 ，每个共享相同的方案和相同的 DbContext 类。 The following is the model of the Products table:以下是 Products 表的模型：

public class Product
{
    public string Sku { get; }
    public string Barcode { get; }
    public int Quantity { get; }
}

I also have a local "Warehouse Cache" database (MySQL) where I want to periodically download the contents of all three warehouses for caching reasons.我还有一个本地“仓库缓存”数据库 (MySQL)，出于缓存原因，我想在其中定期下载所有三个仓库的内容。 The data model of a cached product is similar, with the addition of a number denoting the source warehouse index.缓存产品的数据模型类似，只是增加了一个数字表示源仓库索引。 This table should contain all product information from all three warehouses.此表应包含所有三个仓库的所有产品信息。 If a product appears in both warehouses 1 and 3 (same Sku), then I want to have two entries in the local Cache table, each with the corresponding warehouse ID:如果一个产品同时出现在仓库1和3 （相同的 Sku）中，那么我希望在本地 Cache 表中有两个条目，每个条目都有对应的仓库 ID：

public class CachedProduct
{
    public int WarehouseId { get; set; } // Can be either 1, 2 or 3
    public string Sku { get; }
    public string Barcode { get; }
    public int Quantity { get; }
}

There are multiple possible solutions to this problem, but given the size of my datasets (~20k entries per warehouse), none of them seem viable or efficient, and I'm hoping that someone could give me a better solution.这个问题有多种可能的解决方案，但考虑到我的数据集的大小（每个仓库约 20k 个条目），它们似乎都不可行或有效，我希望有人能给我一个更好的解决方案。

The problem问题

If the local cache database is empty, then it's easy.如果本地缓存数据库是空的，那很容易。 Just download all products from all three warehouses, and dump them into the cache DB.只需从所有三个仓库下载所有产品，并将它们转储到缓存数据库中。 However on subsequent synchronizations, the cache DB will no longer be empty.但是，在随后的同步中，缓存 DB 将不再为空。 In this case, I don't want to add all 60k products again, because that would be a tremendous waste of storage space.在这种情况下，我不想再次添加所有 60k 产品，因为这将极大地浪费存储空间。 Instead, I would like to "upsert" the incoming data into the cache, so new products would be inserted normally, but if a product already exists in the cache (matching Sku and WarehouseId ), then I just want to update the corresponding record (eg the Quantity could have changed in one of the warehouses since the last sync).相反，我想将传入的数据“更新”到缓存中，这样新产品就会正常插入，但是如果缓存中已经存在产品（匹配Sku和WarehouseId ），那么我只想更新相应的记录（例如，自上次同步以来，其中一个仓库中的数量可能已更改）。 This way, the no.这样，没有。 of records in the cache DB will always be exactly the sum of the three warehouses;缓存数据库中的记录数总是三个仓库的总和； never more and never less.永远不会多，永远不会少。

Things I've tried so far到目前为止我尝试过的事情

The greedy method: This one is probably the simplest.贪心法：这可能是最简单的一种。 For each product in each warehouse, check if a matching record exists in the cache table.对于每个仓库中的每个产品，检查缓存表中是否存在匹配记录。 If it does then update , otherwise insert .如果是则update ，否则insert 。 The obvious problem is that there is no way to batch/optimize this, and it would result in tens of thousands of select , insert and update calls being executed on each synchronization.明显的问题是没有办法对其进行批处理/优化，这将导致在每次同步时执行数以万计的select 、 insert和update调用。

:Clearing the Cache: Clear the local cache DB before every synchronization, and re-download all the data. :Clearing the Cache：每次同步前清除本地缓存DB，重新下载所有数据。 My problem with this one is that it leaves a small window of time when no cache data will be available, which might cause problems with other parts of the application.我对这个问题的看法是，它会在没有可用缓存数据的情况下留下一小段时间，这可能会导致应用程序的其他部分出现问题。

Using an EF-Core "Upsert" library: This one seemed the most promising with the FlexLabs.Upsert library, since it seemed to support batched operations.使用 EF-Core“Upsert”库：这似乎是FlexLabs.Upsert库最有前途的库，因为它似乎支持批处理操作。 Unfortunately the library seems to be broken, as I couldn't even get their own minimal example to work properly.不幸的是，图书馆似乎坏了，因为我什至无法让他们自己的最小示例正常工作。 A new row in inserted on every "upsert", regardless of the matching rule.无论匹配规则如何，每个“upsert”都会插入一个新行。

Avoiding EF Core completely: I have found a library called Dotmim.Sync that seems to be a DB-to-DB synchronization library.完全避免 EF Core：我发现了一个名为Dotmim.Sync的库，它似乎是一个 DB 到 DB 同步库。 The main issue with this is that the warehouses are running FirebirdDB which doesn't seem to be supported by this library.这样做的主要问题是仓库正在运行该库似乎不支持的 FirebirdDB。 Also, I'm not sure if I could even do data transformation, since I have to add the WarehouseId column before a row is added to the cache DB.另外，我不确定是否可以进行数据转换，因为我必须在将行添加到缓存数据库之前添加WarehouseId列。

Is there a way to do this as efficiently as possible in EF Core?有没有办法在 EF Core 中尽可能高效地做到这一点？

Answer 1

There are a couple of options here.这里有几个选项。 Which ones are viable depends on your staleness constraints for the Cache.哪些可行取决于您对缓存的陈旧约束。 Must the cache always 100% relect the warehouse state, or can it get out of sync for a period of time.缓存必须始终 100% 反映仓库状态，否则它会在一段时间内不同步。

First, you absolutely should not use EFCore for this, except possibly as a client lib to do raw SQL.首先，您绝对不应该为此使用 EFCore，除非可能作为客户端库来执行原始 SQL。 EfCore is optimized for many small transactions. EfCore 针对许多小型交易进行了优化。 It doesn't do great with batch workloads.它不适用于批处理工作负载。

The 'best' option is probably an event based system. “最佳”选项可能是基于事件的系统。 Firebird supports emitting events to an event listener, which would then update the cache based on the events. Firebird 支持向事件监听器发送事件，然后它会根据事件更新缓存。 The risk here is if event processing fails, you could get out of sync.这里的风险是，如果事件处理失败，您可能会失去同步。 You could mitigate that risk by using an event bus of some sort (Rabbit, Kafka), but Firebird event handling itself would be the weak link.您可以通过使用某种事件总线（Rabbit、Kafka）来降低这种风险，但 Firebird 事件处理本身就是薄弱环节。

If the cache can handle some inconsistency, you could attach a expiry timestamp to each cache entry.如果缓存可以处理某些不一致，您可以将到期时间戳附加到每个缓存条目。 Your application hits the cache, and if the expiry date is past, it rechecks the warehouse dbs.您的应用程序会命中缓存，如果过期日期已过，它会重新检查仓库 dbs。 Depending on the business process that update the source of truth databases you may also be able to bust cache entries (eg if there's an order management system, it can bust the cache for a line item when someone makes an order).根据更新真实数据库源的业务流程，您可能还可以破坏缓存条目（例如，如果有一个订单管理系统，它可以在有人下订单时破坏订单项的缓存）。

If you have to batch sync, do a swap table.如果您必须批量同步，请执行交换表。 Set up a table with the live cache data, a separate table you load the new cache data in, and a flag in your application that says which yo read from.设置一个包含实时缓存数据的表，一个加载新缓存数据的单独表，以及应用程序中的一个标志，说明你从哪个读取。 You read from table A while you load into B, then when the load is done, you swap to read from table B.您在加载到 B 时从表 A 中读取，然后在加载完成后，您切换到从表 B 中读取。

Answer 2

For now I ended up going with a simple, yet effective solution that is fully within EF Core.现在，我最终选择了一个完全在 EF Core 中的简单而有效的解决方案。

For each cache entry, I also maintain a SyncIndex column.对于每个缓存条目，我还维护一个 SyncIndex 列。 During synchronization, I download all products from all three warehouses, I set SyncIndex to max(cache.SyncIndex) + 1 , and I dump them into the cache database.在同步期间，我从所有三个仓库下载所有产品，将SyncIndex设置为max(cache.SyncIndex) + 1 ，并将它们转储到缓存数据库中。 Then I delete all entries from the cache with an older SyncIndex .然后我使用旧的SyncIndex从缓存中删除所有条目。 This way I always have some cache data available, I don't waste a lot of space, and the speed is pretty acceptable too.这样我总是有一些可用的缓存数据，我不会浪费很多空间，而且速度也可以接受。

跨 EC Core 上下文同步大型数据库表的最佳方法是什么？

问题描述

My Scenario我的情景

The problem问题

Things I've tried so far到目前为止我尝试过的事情

2 个解决方案

解决方案1
1 2022-07-09 23:39:30

解决方案2
1 2022-07-10 06:17:06

跨 EC Core 上下文同步大型数据库表的最佳方法是什么？

问题描述

My Scenario我的情景

The problem问题

Things I've tried so far到目前为止我尝试过的事情

2 个解决方案

解决方案1 1 2022-07-09 23:39:30

解决方案2 1 2022-07-10 06:17:06

解决方案1
1 2022-07-09 23:39:30

解决方案2
1 2022-07-10 06:17:06