简体   繁体   English

在MySQL中使用集的更快方法

[英]faster way to use sets in MySQL

I have a MySQL 5.1 InnoDB table ( customers ) with the following structure: 我有一个具有以下结构的MySQL 5.1 InnoDB表( customers ):

int         record_id (PRIMARY KEY)
int         user_id (ALLOW NULL)
varchar[11] postcode (ALLOW NULL)
varchar[30] region (ALLOW NULL)
..
..
..

There are roughly 7 million rows in the table. 表格中大约有700万行。 Currently, the table is being queried like this: 当前,正在按以下方式查询表:

SELECT * FROM customers WHERE user_id IN (32343, 45676, 12345, 98765, 66010, ...

in the actual query, currently over 560 user_id s are in the IN clause. 在实际查询中, IN子句中当前有560个以上的user_id With several million records in the table, this query is slow ! 由于表中有几百万条记录,因此此查询速度

There are secondary indexes on table, the first of which being on user_id itself, which I thought would help. 表上有二级索引,第一个位于user_id本身,我认为这会有所帮助。

I know that SELECT(*) is A Bad Thing and this will be expanded to the full list of fields required. 我知道SELECT(*)是一件坏事,它将被扩展到所需字段的完整列表。 However, the fields not listed above are more int s and double s. 但是,上面未列出的字段是更多intdouble There are another 50 of those being returned, but they are needed for the report. 还有那些被退回的另一个50,但他们需要的报告。

I imagine there's a much better way to access the data for the user_id s, but I can't think how to do it. 我想有一种更好的方法来访问user_id的数据,但是我不知道该怎么做。 My initial reaction is to remove the ALLOW NULL on the user_id field, as I understand NULL handling slows down queries? 我的最初反应是删除user_id字段上的ALLOW NULL ,因为我了解NULL处理会减慢查询速度?

I'd be very grateful if you could point me in a more efficient direction than using the IN ( ) method. 如果您能指出比使用IN ( )方法更有效的方向,我将不胜感激。

EDIT Ran EXPLAIN, which said: 编辑然解释,说:

select_type = SIMPLE 
table = customers 
type = range 
possible_keys = userid_idx 
key = userid_idx 
key_len = 5 
ref = (NULL) 
rows = 637640 
Extra = Using where 

does that help? 有帮助吗?

First, check if there is an index on USER_ID and make sure it's used . 首先,检查USER_ID上是否有索引, 并确保已使用

You can do it with running EXPLAIN . 您可以通过运行EXPLAIN

Second, create a temporary table and use it in a JOIN : 其次,创建一个临时表并在JOIN使用它:

CREATE TABLE temptable (user_id INT NOT NULL)

SELECT  *
FROM    temptable t
JOIN    customers c
ON      c.user_id = t.user_id

Third, how may rows does your query return? 第三,查询如何返回行?

If it returns almost all rows, then it just will be slow, since it will have to pump all these millions over the connection channel, to begin with. 如果它返回几乎所有行,那么它将很慢,因为它首先必须通过连接通道泵送所有这几百万个数据。

NULL will not slow your query down, since the IN condition only satisfies non- NULL values which are indexed. NULL不会减慢查询速度,因为IN条件仅满足索引的非NULL值。

Update: 更新:

The index is used, the plan is fine except that it returns more than half a million rows. 使用索引,该计划是好的,除了它返回超过一百万行。

Do you really need to put all these 638,000 rows into the report? 您是否真的需要将所有这638,000行放入报告中?

Hope its not printed: bad for rainforests, global warming and stuff. 希望它不会被印出来:对雨林,全球变暖和其他事物有害。

Speaking seriously, you seem to need either aggregation or pagination on your query. 认真地说,您似乎需要对查询进行聚合或分页。

"Select *" is not as bad as some people think; “选择*”并不像某些人想象的那样糟糕。 row-based databases will fetch the entire row if they fetch any of it, so in situations where you're not using a covering index, "SELECT *" is essentially no slower than "SELECT a,b,c" (NB: There is sometimes an exception when you have large BLOBs, but that is an edge-case). 基于行的数据库将提取整行,因此在不使用覆盖索引的情况下,“ SELECT *”从本质上来讲不会比“ SELECT a,b,c”慢(注意:当您有较大的BLOB时,有时是一个例外,但这是一个极端的情况。

First things first - does your database fit in RAM? 首先,您的数据库是否适合RAM? If not, get more RAM. 如果没有,请获取更多的RAM。 No, seriously. 不,认真 Now, suppose your database is too huge to reasonably fit into ram (Say, > 32Gb) , you should try to reduce the number of random I/Os as they are probably what's holding things up. 现在,假设您的数据库太大而无法合理地放入ram(例如,> 32Gb),那么您应该尝试减少随机I / O的数量,因为它们可能会使事情停滞不前。

I'll assuming from here on that you're running proper server grade hardware with a RAID controller in RAID1 (or RAID10 etc) and at least two spindles. 从这里开始,我假设您正在使用带有RAID1(或RAID10等)中的RAID控制器和至少两个主轴的适当服务器级硬件。 If you're not, go away and get that. 如果不是,请离开并获取该信息。

You could definitely consider using a clustered index. 您绝对可以考虑使用聚集索引。 In MySQL InnoDB you can only cluster the primary key, which means that if something else is currently the primary key, you'll have to change it. 在MySQL InnoDB中,您只能对主键进行集群,这意味着如果当前主键有其他内容,则必须对其进行更改。 Composite primary keys are ok, and if you're doing a lot of queries on one criterion (say user_id) it is a definite benefit to make it the first part of the primary key (you'll need to add something else to make it unique). 复合主键是可以的,并且如果您要对一个条件(例如user_id)进行大量查询,则将其设为主键的第一部分无疑是有好处的(您需要添加其他内容才能使其成为主键)独特)。

Alternatively, you might be able to make your query use a covering index, in which case you don't need user_id to be the primary key (in fact, it must not be). 或者,您可以使查询使用覆盖索引,在这种情况下,您不需要user_id作为主键(实际上,不必这样)。 This will only happen if all of the columns you need are in an index which begins with user_id. 仅当您需要的所有列都在以user_id开头的索引中时,才会发生这种情况。

As far as query efficiency is concerned, WHERE user_id IN (big list of IDs) is almost certainly the most efficient way of doing it from SQL. 就查询效率而言,WHERE user_id IN(大ID列表)几乎可以肯定是从SQL执行此操作的最有效方法。

BUT my biggest tips are: 但是我最大的提示是:

  • Have a goal in mind, work out what it is, and when you reach it, stop. 牢记目标,找出目标,并在达到目标时停止。
  • Don't take anybody's word for it - try it and see 不要相信任何人的话-试试看
  • Ensure that your performance test system is the same hardware spec as production 确保您的性能测试系统与生产的硬件规格相同
  • Ensure that your performance test system has the same data size and kind as production (same schema is not good enough!). 确保性能测试系统的数据大小和种类与生产相同(相同的架构还不够好!)。
  • Use synthetic data if it is not possible to use production data (Copying production data may be logistically difficult (Remember your database is >32Gb) ; it may also violate security policies). 如果无法使用生产数据,则使用合成数据(复制生产数据在逻辑上可能会很困难(请记住您的数据库> 32Gb);这也可能违反安全策略)。
  • If your query is optimal (as it probably already is), try tuning the schema, then the database itself. 如果您的查询是最佳的(可能已经是最佳查询),请尝试调整架构,然后调整数据库本身。

Are they the same ~560 id's every time? 每次都是一样的〜560 id吗? Or is it a different ~500 ids on different runs of the queries? 还是在不同的查询运行中使用不同的〜500 id?

You could just insert your 560 UserIDs into a separate table (or even a temp table), stick an index on the that table and inner join it to you original table. 您可以将560个UserID插入单独的表(甚至临时表)中,在该表上粘贴索引,然后将其内部连接到原始表。

Is this your most important query? 这是您最重要的查询吗? Is this a transactional table? 这是交易表吗?

If so, try creating a clustered index on user_id. 如果是这样,请尝试在user_id上创建聚簇索引。 Your query might be slow because it still must make random disk reads to retrieve the columns (key lookups), even after finding the records that match (index seek on the user_Id index). 您的查询可能很慢,因为即使找到匹配的记录(在user_Id索引上进行索引查找),它仍然必须随机读取磁盘以检索列(键查找)。

If you cannot change the clustered index, then you might want to consider an ETL process (simplest is a trigger that inserts into another table with the best indexing). 如果无法更改聚簇索引,则可能需要考虑ETL流程(最简单的方法是将触发器插入具有最佳索引的另一个表中)。 This should yield faster results. 这将产生更快的结果。

Also note that such large queries may take some time to parse, so help it out by putting the queried ids into a temp table if possibl 还要注意,如此大的查询可能需要一些时间来解析,因此如果可能的话,可以通过将查询到的ID放入临时表中来解决问题

You can try to insert the ids you need to query on in a temp table and inner join both tables. 您可以尝试在临时表中插入需要查询的ID,并内部联接两个表。 I don't know if that would help. 我不知道这是否有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM