简体   繁体   English

mySQL:是否可以更快地进行此查询?

[英]mySQL: is it possible to make this query any faster?

I have a table "test" containing millions of entries. 我有一个包含数百万条目的表“test”。 Each row contains a floating point "feature" and a "count" how often this feature is present in item "id". 每行包含一个浮点“特征”和一个“计数”这个特征在项目“id”中出现的频率。 The primary key for this table is the combination of "id" and "feature", ie every item may have multiple features. 该表的主键是“id”和“feature”的组合,即每个项目可能具有多个功能。 There are usually a couple of hundred to a couple of thousand feature entries per item id. 每个商品ID通常有几百到几千个要素条目。

create table test 
(
    id      int not null,
    feature double not null,
    count   int not null
);

The task is to find the 500 most similar items to a given reference item. 任务是找到给定参考项目的500个最相似的项目。 Similarity is measured in number of identical feature values in both items. 相似性以两个项目中相同特征值的数量来度量。 The query I have come up with is quoted below, but despite properly using indices its execution plan still contains "using temporary" and "using filesort", giving unacceptable performance for my use case. 我提出的查询在下面引用,但尽管正确使用索引,其执行计划仍包含“using temporary”和“using filesort”,为我的用例提供了不可接受的性能。

select 
    t1.id,
    t2.id,
    sum( least( t1.count, t2.count )) as priority 
from test as t1
inner join test as t2 
     on t2.feature = t1.feature
where t1.id = {some user supplied id value} 
group by t1.id, t2.id 
order by priority desc
limit 500;

Any ideas on how to improve on this? 关于如何改进的任何想法? The schema can be modified and indices added as needed. 可以修改模式并根据需要添加索引。

With the current schema, this query hardly can be improved. 使用当前架构,此查询几乎无法改进。

You already have an index on feature and this is the best you can do with the current schema design. 您已经有一个feature索引,这是您可以使用当前架构设计做的最好的。

The problem is more similar than is not a relationship of order. 问题不是秩序关系更相似 If a is more similar to b than it is to c , it does not imply that c is less similar to a than it is to b . 如果a更类似于b比它c ,它并不意味着c少类似于a比它是b Hence, you cannot build a single index describing this relationship, and need to do it for each item separately, which would make your index N^2 entries long, where N is the number of items. 因此,您无法构建描述此关系的单个索引,并且需要单独为每个项执行此操作,这将使您的索引N^2条目变长,其中N是项目数。

If you always need only top 500 items, you can limit your index to that figure (in which case it will hold 500 * N entries). 如果您始终只需要前500项目,则可以将索引限制为该数字(在这种情况下,它将保留500 * N个条目)。

MySQL does not support indexed or materialized views, so you will have to do it yourself: MySQL不支持索引或物化视图,因此您必须自己完成:

  1. Create a table like this: 创建一个这样的表:

     CREATE TABLE similarity ( id1 INT NOT NULL, id2 INT NOT NULL, similarity DOUBLE NOT NULL, PRIMARY KEY (id1, id2), KEY (id1, similarity) ) 
  2. Whenever you insert a new feature into the table, reflect the changes in the similarity : 每当您在表中插入新功能时,请反映similarity的变化:

     INSERT INTO similarity SELECT @newid, id, LEAST(@newcount, count) AS ns FROM test WHERE feature = @newfeature AND id <> @newid ON DUPLICATE KEY UPDATE SET similarity = similarity + ns; INSERT INTO similarity SELECT @newid, id, LEAST(@newcount, count) AS ns FROM test WHERE feature = @newfeature AND id <> @newid ON DUPLICATE KEY UPDATE SET similarity = similarity + ns; 
  3. On a timely basis, remove the excess similarities: 及时,删除多余的相似之处:

     DELETE s FROM ( SELECT id1, ( SELECT similarity FROM similarity si WHERE si.id1 = s.id1 ORDER BY si.id1 DESC, si.similarity DESC LIMIT 499, 1 ) AS cs FROM ( SELECT DISTINCT id1 FROM similarity ) s ) q JOIN similarity s ON s.id1 = q.id1 AND s.similarity < q.cs 
  4. Query your data: 查询您的数据:

     SELECT id2 FROM similarity WHERE id1 = @myid ORDER BY similarity DESC LIMIT 500 

Having a floating point number as part of Primary Key (PK) is a killer. 将浮点数作为主键(PK)的一部分是一个杀手。 For that matter it should not be a part of any constraint - Unique Key (UK), Foreign Key (FK) etc. 就此而言,它不应该是任何约束的一部分 - 唯一键(英国),外键(FK)等。

To improve the performance of your SQL query many fold, try changing your schema as below: 要提高SQL查询的性能,请尝试更改您的架构,如下所示:

CREATE TABLE test ( 
item_id      INTEGER,
feature_id INTEGER,
count   INTEGER );

CREATE TABLE features (
id   INTEGER, feature_value double not null );

CREATE TABLE items (
id   INTEGER, item_description varchar2(100) not null );

ALTER TABLE test ADD CONSTRAINT fk_test_item_id foreign key (item_id) references items(id);

ALTER TABLE test ADD CONSTRAINT fk_test_feature_id foreign key(feature_id) references features(id);

With your test table normalized as above, I have separated items and feature to its own separate tables and this becomes more than a mere mapping table bearing the count of each mapping. 如上所示将测试表标准化,我将项目和功能分离到它自己的单独表中,这不仅仅是一个带有每个映射计数的映射表。

Should you now fire the SQL query you have fired earlier with little modifications as mentioned below, you should see a significant/drastic improvement in the SQL query performance. 您现在应该触发先前已经解决的SQL查询,如下所述进行少量修改,您应该会看到SQL查询性能的显着/显着改进。

select t1.id, t2.id, sum( least( t1.count, t2.count )) as priority 
from test as t1 inner join test as t2 on t2.feature_id = t1.feature_id 
where t1.id = {some user supplied id value}
group by t1.id, t2.id 
order by priority desc
limit 500;

Cheers! 干杯!

One optimization would be to exclude the item itself from the self-join: 一个优化是将项目本身从自联接中排除:

inner join test as t2 
     on t2.feature = t1.feature and t2.id <> t1.id
                                    ^^^^^^^^^^^^^^

For further speedup, create a covering index on (feature, id, count) . 要进一步加速,请在(feature, id, count)上创建覆盖索引。

I would start with this... love to hear back on performance you are looking at. 我会从这开始...喜欢听到你正在看的表现。 I don't think you needed the LEAST( of t1 vs t2 counts ). 我认为你不需要最少(t1对t2计数)。 If you are first qualifying the where based on ID = {some value}, you will obviously get all those "features". 如果您是第一次根据ID = {some value}来确定哪个位置,您显然会得到所有这些“功能”。 Then via a self-join to itself only concerned with the matching "features", you get a count. 然后通过自我联接到自己只关注匹配的“功能”,你得到一个计数。 Since you are breaking it down by by ID1 and ID2, each respective "feature" will be counted once. 由于您按ID1和ID2分解,因此每个相应的“功能”将被计算一次。 At the end of this query, since I'm not expclicitly excluding t2.ID equal to the {some user value}, It's count should be the EXACT SAME count of features in t1, and anything else under that would be your other closest matches. 在这个查询结束时,因为我没有明确地将t2.ID等于{some user value},所以它的计数应该是t1中特征的精确计数,而其他任何东西都是你最接近的匹配。

I would ensure I had an index on ID and FEATURE. 我会确保我有一个关于ID和FEATURE的索引。

select STRAIGHT_JOIN
      t1.id,
      t2.id, 
      count(*) as MatchedInBoth
   from 
      test as t1,
      test as t2
   where 
          t1.id = {some user value}
      and t1.feature = t2.feature
   group by
      t1.id,
      t2.id
   order by 
      MatchedInBoth desc 
   limit 
      500; 

The result might give something like 结果可能会给出类似的结果

t1            t2           MatchedInBoth
{user value}  {user value} 275
{user value}  Other ID 1   270
{user value}  Other ID 2   241
{user value}  Other ID 3   218
{user value}  Other ID 4   197
{user value}  Other ID 5   163, etc

Can you knock it down to just one table? 你能把它打到一张桌子吗? Usinq subqueries you might be able to avoid the join and it will be a win if the subqueries are faster, indexed, and executed exactly once. Usinq子查询您可以避免连接,如果子查询更快,索引并执行一次,它将是一个胜利。 Something like this (untested). 像这样(未经测试)。

select
t2.id,
SUM( t2.count ) as priority
from test as t2
where t2.id = {some user supplied id value} AND
t2.count > (SELECT MIN(count) FROM test t1 WHERE id= {some user supplied value} ) AND
t2.feature IN (SELECT feature FROM test t1 WHERE id= {some user supplied value} )
group by t1.id
order by priority desc
limit 500;

If that doesnt work Mysql is terrible at realizing the inner selects are constant tables and will re-execute them for each row. 如果这不起作用Mysql很难实现内部选择是常量表并将为每一行重新执行它们。 Wrapping them in a select again forces a constant table lookup. 将它们再次包装在选择中会强制执行常量表查找。 Heres a hack: 这是一个黑客:


select
t1.id,
SUM( t2.count ) as priority
from test as t2
where t2.id = {some user supplied id value} AND
t2.count > (
SELECT * FROM (
SELECT MIN(count) FROM test t1 WHERE id= {some user supplied
value} ) as const ) AND
t2.feature IN ( SELECT * from (
SELECT feature FROM test t1 WHERE id= {some user supplied value}
) as const )
group by t1.id
order by priority desc
limit 500;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM