从具有多个线程的单个数据库表中获取数据的最佳方法？

Question

we have a system where we collect data every second on user activity on multiple web sites. 我们有一个系统，我们每秒在多个网站上收集有关用户活动的数据。 we dump that data into a database X (say MS SQL Server). 我们将该数据转储到数据库X中（例如MS SQL Server）。 we now need to fetch data from this single table from daatbase X and insert into database Y (say mySql). 现在，我们需要从daatbase X的单个表中获取数据，并将其插入数据库Y（例如mySql）。

we want to fetch time based data from database X through multiple threads so that we fetch as fast as we can. 我们想通过多个线程从数据库X中获取基于时间的数据，以便我们能够尽快获取数据。 Once fetched and stored in database Y, we will delete data from database X. 一旦获取并存储在数据库Y中，我们将从数据库X中删除数据。

Are there any best practices on this sort of design? 这种设计是否有最佳实践？ any specific things to take care on table design like sharing or something? 餐桌设计上有什么特别要注意的地方，例如共享之类的东西？ Are there any other things that we need to take care to make sure we fetch it as fast as we can from threads running on multiple machines? 还有其他需要注意的事情，以确保从多台计算机上运行的线程以最快的速度获取它吗？

Thanks in advance! 提前致谢！ Ravi 拉维

Answer 1

If you are moving data from one database to another, you will not gain any advantages by having multiple threads doing the work. 如果要将数据从一个数据库移动到另一个数据库，那么让多个线程来工作将不会获得任何优势。 It will only increase contention. 这只会增加争用。

If both databases are of the same type, you should be looking into the vendors specific tools for replication. 如果两个数据库的类型相同，则应研究供应商特定的复制工具。 This will basically always outperform homegrown solutions. 从根本上讲，这将永远胜过本地解决方案。

If the databases are different (vendors), you have to decide upon an efficient mechanism for 如果数据库不同（供应商），则必须决定一种有效的机制

identifying new/updated/deleted rows (Triggers, range based queries, full dumps) 识别新/更新/删除的行（触发器，基于范围的查询，完整转储）
transporting the data (unload to file & FTP, pull/push from a program) 传输数据（卸载到文件和FTP，从程序中拉/推）
loading the data on the other database (import, bulk insert) 将数据加载到另一个数据库上（导入，批量插入）

Without more details, it's impossible to be more specific than that. 如果没有更多细节，再没有比这更具体的了。 Oh, and the two most important considerations that will influence your choice are: 哦，影响您选择的两个最重要的注意事项是：

What is the expected data volume? 预期数据量是多少？
Longest acceptable delay between row creation in source DB and availability in Target DB 源数据库中的行创建与目标数据库中的可用性之间的最长可接受延迟

Answer 2

I would test (by measurement) your assumption that multiple slurper threads will speed things up. 我将测试（通过测量）您的假设，即多个slurper线程将加快处理速度。 Without being more specific in your question, it looks like you want to do an ETL (extract transform load) process with your database, these are pretty efficient when you let the database specific technology handle it, especially if you're interested in aggregation etc. 在您的问题中没有更具体的说明，您似乎想对数据库执行ETL（提取转换负载）过程，当您让特定于数据库的技术来处理它时，这些过程非常有效，特别是如果您对聚合等感兴趣的话。

Answer 3

There are two levels of concern of your issue: 您的问题有两个关注级别：

The transaction between these two database: 这两个数据库之间的事务：
This is important because you would delete database from source database. 这很重要，因为您将从源数据库中删除数据库。 You must ensure that only remove data from X while the database has been stored into Y successfully. 您必须确保仅在数据库成功存储到Y时才从X删除数据。 On the other side, your must ensure that the deletion of data from X must be successful to prevent re-insert same data into Y. 另一方面，您必须确保必须成功从X删除数据，以防止将相同数据重新插入Y。
The performance of transferring data: 传输数据的性能：
If the X database has incoming data whenever, which is a online database, it is not a good practice that just collect data, store to Y, and delete them. 如果X数据库每时每刻都有传入数据（这是一个联机数据库），则不建议仅收集数据，存储到Y并将其删除。 Planning a size of batch, the program starts a transaction for that batch; 计划批次的大小后，程序将开始该批次的事务。 running the program repeatedly until the number of data in X is under the size of batch. 重复运行该程序，直到X中的数据数量小于批处理大小为止。

In both of databases, your should add a table to record the batch for processing. 在这两个数据库中，您应该添加一个表来记录要处理的批次。 There are three states in processing. 处理中有三种状态。

INIT - The start of batch, this value should be synchronized between two databases
COPIED - In database Y, the insertion of data and the update of this status should be in one transaction.
FINISH - In database X, the deletion of data and the update of this status should be in on transaction.

When the programing is running, it first checks the batches in 'INIT' or 'COPIED' state and restarts the session to process. 编程运行时，它将首先检查处于“ INIT”或“ COPIED”状态的批处理，然后重新启动要处理的会话。

If X has an "INIT" record and Y don't, just insert the same INIT record to Y, then perform the insertion to Y. 如果X具有“ INIT”记录，而Y没有，则只需将相同的INIT记录插入Y，然后再插入Y。
If a record in Y is "COPIED" and X is "INIT", just update the state of X to "COPIED", then perform the deletion to X. 如果Y中的记录是“ COPIED”，而X是“ INIT”，则只需将X的状态更新为“ COPIED”，然后将其删除为X。
If a record in X is "FINISH" and the corresponding record in Y is "COPIED", just update the the state of Y to "FINISH". 如果X中的记录为“ FINISH”，而Y中的对应记录为“ COPIED”，则只需将Y的状态更新为“ FINISH”。

In conclusion, processing data at a batch would give you a chance to optimize such transferring between two databases. 总之，批量处理数据将使您有机会优化两个数据库之间的传输。 The number of batch size dominates the efficiency of transforming and depends on two factors: how those databases concurrently used by other operation and the tuning parameter of your databases. 批处理大小的数量决定着转换的效率，它取决于两个因素：其他操作如何同时使用那些数据库以及数据库的调整参数。 In general situation, the write-throughput of Y is likely the bottleneck of processing. 在一般情况下，Y的写吞吐量可能是处理的瓶颈。

Answer 4

Threads are not the way to go. 线程不是要走的路。 The database(s) is the bottleneck here. 数据库是这里的瓶颈。 Multiple threads will only increase contention. 多个线程只会增加竞争。 Even if 10 processes are jamming data into SQL Server, a single thread (rather than many) can pull it out faster. 即使有10个进程将数据阻塞到SQL Server中，单个线程（而不是多个线程）也可以更快地将其拔出。 There is absolutely no doubt about that. 对此毫无疑问。

The SELECT itself can cause locks in the main table, reducing the throughput of the INSERTs, so I would "get in and get out" as fast as possible. SELECT本身可能会在主表中引起锁定，从而降低INSERT的吞吐量，因此我将尽可能快地“进出”。 If it were me, I would: 如果是我，我会：

SELECT the rows based on a range query (date, recno, whatever), dump them into a file, and close the result set (cursor). 根据范围查询（日期，renoo等）选择行，将其转储到文件中，然后关闭结果集（光标）。
DELETE the rows based on the same range query. 根据相同的范围查询删除行。
Then process the dump. 然后处理转储。 If possible, the dump format should be amenable to bulk-load into MySQL. 如果可能，转储格式应适合于批量装入MySQL。

I don't want to beat up your architecture, but overall the design sounds problematic. 我不想破坏您的体系结构，但是总体而言，设计听起来很成问题。 SELECTing and DELETEing rows from a table undergoing a high INSERTion rate is going to create huge locking issues. 从插入率很高的表中选择和删除行将产生巨大的锁定问题。 I would be looking at "double-buffering" the data in the SQL Server. 我将查看“双缓冲” SQL Server中的数据。

For example, every minute the inserts switch between two tables. 例如，插入每分钟在两个表之间切换。 For example, in the first minute INSERTs go into TABLE_1, but when the minute rolls over they start INSERTing into TABLE_2, the next minute back to TABLE_1, and so forth. 例如，在第一分钟，INSERT进入TABLE_1，但是当分钟过去时，它们开始INSERT插入TABLE_2，第二分钟又回到TABLE_1，依此类推。 While INSERTS are going into TABLE_2, SELECT everything from TABLE_1 and dump it into MySQL (as efficiently as possible), then TRUNCATE the table (deleting all rows with zero penalty). 当INSERTS进入TABLE_2时，从TABLE_1中选择所有内容，并将其转储到MySQL中（尽可能有效），然后对表进行TRUNCATE（删除所有行，零罚金）。 This way, there is never lock-contention between the readers and writers. 这样，读者和作家之间就不会发生争用锁。

Coordinating the rollover point of between TABLE_1 and TABLE_2 is the tricky part. 协调TABLE_1和TABLE_2之间的转换点是棘手的部分。 But it can be done automatically through a clever use of SQL Server Partitioned Views. 但是，可以通过巧妙地使用SQL Server分区视图来自动完成此操作。

从具有多个线程的单个数据库表中获取数据的最佳方法？

问题描述

4 个解决方案

解决方案1
1 2010-12-23 14:37:21

解决方案2
0 2010-12-23 13:34:29

解决方案3
0 2012-03-09 04:03:27

解决方案4
0 2012-03-09 04:23:36

从具有多个线程的单个数据库表中获取数据的最佳方法？

问题描述

4 个解决方案

解决方案1 1 2010-12-23 14:37:21

解决方案2 0 2010-12-23 13:34:29

解决方案3 0 2012-03-09 04:03:27

解决方案4 0 2012-03-09 04:23:36

解决方案1
1 2010-12-23 14:37:21

解决方案2
0 2010-12-23 13:34:29

解决方案3
0 2012-03-09 04:03:27

解决方案4
0 2012-03-09 04:23:36