简体   繁体   English

将行的子集从一个表复制到另一个表,在两列上进行过滤

[英]copy subset of rows from one table to another, filtering on two columns

I have the following MySql table containing my raw event data (about 1.5 million rows) 我有以下MySql表包含我的原始事件数据(约150万行)

userId  | pathId  | other stuff....

I have an index on userId, pathId (approx 50,000 unique combinations) 我在userId, pathId上有一个索引userId, pathId (大约50,000个唯一组合)

During my processing, I identify 30,000 userId, pathId values that I don't want, but I do want to keep the original raw table. 在我处理过程中,我确定了30,000个userId, pathId我不想要的userId, pathId值,但我确实希望保留原始的原始表。 So I want to copy all rows into a processed event table, except the rows that match this 30,000 userId, pathId values. 所以我想将所有行复制到已处理的事件表中,但与此30,000 userId, pathId值匹配的行除外。

An approach I'm considering is to write the 30,000 userId,PathId values of the rows I do not want into a temp_table, and then doing something like this: 我正在考虑的一种方法是将我不想要的行的30,000 userId,PathId值写入temp_table,然后执行以下操作:

[create table processed_table ...]
insert into processed_table 
   select * from raw_table r 
   where not exists (
       select * from temp_table t where r.userId=t.userid and r.pathId=t.pathId
   )

For info, processed_table generally ends up being half the size of raw_table . 有关信息, processed_table通常最终只是raw_table一半。

Anyway, this seems to work but my SQL skills are limited, so my question (finally) is - is this the most efficient way to do this? 无论如何,这似乎有效但我的SQL技能有限,所以我的问题(最后)是 - 这是最有效的方法吗?

No, it's not the most efficient. 不,这不是最有效的。 Source 资源

That's why the best way to search for missing values in MySQL is using a LEFT JOIN / IS NULL or NOT IN rather than NOT EXISTS. 这就是为什么在MySQL中搜索缺失值的最佳方法是使用LEFT JOIN / IS NULL或NOT IN而不是NOT EXISTS。

Here's an example with NOT IN : 以下是NOT IN的示例:

INSERT INTO processed_table 
SELECT *
FROM raw_table 
WHERE (userId, pathId) NOT IN (
    SELECT userId, pathId FROM temp_table
)

And LEFT JOIN ... IS NULL : LEFT JOIN ... IS NULL

INSERT INTO processed_table 
SELECT *
FROM raw_table r
LEFT JOIN temp_table t
ON r.userId = t.userid AND r.pathId = t.pathId
WHERE t.userId IS NULL

However, since your table is very small and has only 50,000 rows, your original query is probably fast enough. 但是,由于您的表非常小并且只有50,000行,因此您的原始查询可能足够快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM