如何编写可以有效处理大量记录的查询？

Question

Suppose I have a Table X that has a billion records. 假设我有一个具有十亿条记录的表X。

Table X

ProductID AccountID ContractID

ProductID and AccountID make a composite key for Table X. ProductID和AccountID构成表X的组合键。

Now, in memory, I have a map (let's say Java HashMap) that contains a million (ProductID, AccountID) pairs. 现在，在内存中，我有一个包含一百万（ProductID，AccountID）对的映射（比如Java HashMap）。

I want to create a file that will contain all the (ProductID, AccountID) as well as the corresponding ContractID for that pair. 我想创建一个文件，其中将包含该对的所有（产品ID，帐户ID）以及相应的合同ID。

Now I can use a for loop and for each (ProductID, AccountID) I can query the table, but then I would have to do this a million times and it would be really inefficient. 现在，我可以使用一个for循环，并且可以为每个表（ProductID，AccountID）查询该表，但是那时我将不得不执行一百万次，这实际上是低效的。

The question is, how to write a query that will do this efficiently? 问题是，如何编写查询来有效地做到这一点？ Or can such a query be written at all? 还是完全可以编写这样的查询？ Is there another way out? 还有其他出路吗？

Answer 1

If speed and efficiency are of importance, then a query with a million "unions" or a million items in an IN clause is not going to be acceptable. 如果速度和效率很重要，那么在IN子句中具有一百万个“联合”或一百万个项目的查询将是不可接受的。

A more performant solution would be to perform a bulk insert of your ProductID/AccountID hashmap into a temp table, let's call it #temp. 更具性能的解决方案是将ProductID / AccountID哈希映射批量插入到临时表中，我们将其称为#temp。 I'm not going to describe the bulk insert because that is database dependent. 我将不描述批量插入，因为这取决于数据库。 Then you can perform a simple join query: 然后，您可以执行一个简单的联接查询：

SELECT ProductID, AccountID, ContractID
FROM X
INNER JOIN #temp t ON t.ProductID = X.ProductID AND t.AccountID = X.AccountID

Answer 2

Without knowing the exact SQL dialect, I'd perform an INNER JOIN : 在不知道确切的SQL方言的情况下，我将执行INNER JOIN ：

SELECT ProductID, AccountID, ContractID
FROM X
INNER JOIN MemTable m ON m.ProductID = X.ProductID AND m.AccountID = X.AccountID

You now added Java as a tag, so am I right in thinking that the map is within your Java application? 您现在已将Java添加为标签，所以我是否认为该地图位于Java应用程序之内？ If so, it will get tough - you may actually need to query the database a million times. 如果是这样，它将变得很困难-您实际上可能需要查询数据库一百万次。

On the other hand you could construct a string containing one single, large SQL statement like that: 另一方面，您可以构造一个包含单个大型SQL语句的字符串，如下所示：

SELECT * FROM X WHERE ProductID IN (...) AND AccountID IN (...)

where your loop just fills in a list of product IDs and account IDs comma separated. 您的循环只需要填写产品ID和帐户ID的列表（以逗号分隔）。 Then you issue that command once. 然后，您一次发出该命令。 The command should for example look like this, assuming both IDs are numeric: 例如，假设两个ID均为数字，则命令应如下所示：

SELECT * FROM X WHERE ProductID IN (1,2,3,4) AND AccountID IN (99,88,77)

EDIT 编辑
Please note that my last suggestion may have the following flaw (you'll have to decide whether this is actually a problem for you): 请注意，我的最后建议可能存在以下缺陷（您必须确定这是否确实是您的问题）：

Assume your map contains (1, 99) and (3, 77), but in table X there are additional records (1, 77) and (3, 99). 假设您的地图包含（1，99）和（3，77），但是在表X有其他记录（1，77）和（3，99）。 The result of my query will be (1,99), (3, 77), (1, 77) and (3, 99) as both IDs are not treated as an "entity", but individually. 我的查询结果将是（1,99），（3、77），（1、77）和（3、99），因为这两个ID都不被视为“实体”，而是被单独对待。

So as long as there are rows that contain any combination of the given ProductID and AccountID, they will be returned. 因此，只要存在包含给定ProductID和AccountID的任意组合的行，它们就会被返回。

Assuming the DB system you're using allows for this, you could expand the SELECT statement into something like this: 假设您正在使用的数据库系统允许这样做，则可以将SELECT语句扩展为如下所示：

SELECT ProductID, AccountID, ContractID FROM X WHERE ProductID = <ValueFromMap> AND AccountID = <ValueFromMap>
UNION ALL
SELECT ProductID, AccountID, ContractID FROM X WHERE ...
UNION ALL
...

Answer 3

I guess your memory map is in your Java program? 我猜您的内存映射在您的Java程序中？ If so I think there is no efficient solution that will be database independent. 如果是这样，我认为没有一个独立于数据库的有效解决方案。 Best I can think of is to try and find continous id-ranges in your memory map so that you can write SELECT FROM X where ID >= xx AND id <= yy and avoid selecting duplicate ids. 我能想到的最好的办法是尝试在内存映射中查找连续的id范围，以便您可以编写SELECT FROM X，其中ID> = xx AND id <= yy，并避免选择重复的ID。

如何编写可以有效处理大量记录的查询？

问题描述

3 个解决方案

解决方案1
2 2013-06-12 15:29:13

解决方案2
1 已采纳 2013-06-12 15:01:31

解决方案3
0 2013-06-12 15:07:15

如何编写可以有效处理大量记录的查询？

问题描述

3 个解决方案

解决方案1 2 2013-06-12 15:29:13

解决方案2 1 已采纳 2013-06-12 15:01:31

解决方案3 0 2013-06-12 15:07:15

解决方案1
2 2013-06-12 15:29:13

解决方案2
1 已采纳 2013-06-12 15:01:31

解决方案3
0 2013-06-12 15:07:15