如何优化SQL查询以更快地处理大数据？

Question

嗨，我有表来存储这样的标签：

sate: publish:1 / unpublish:0

id | name | releated_content_id | state
1     a           1                 1
2     a           2                 1
3     a           3                 1
4     a           4                 1
5     b           1                 1
6     b           2                 1
7     b           3                 1  
8     c           1                 1
.
.
.

现在，我尝试获得大多数重复标签的前7个名称及其计数。

我用这个查询：

SELECT name, COUNT(name) count 
    FROM Tags 
    WHERE state = '1' 
    GROUP BY name 
    ORDER BY count 
    DESC LIMIT 7

它工作正常，但速度太慢（加载超过10秒），因为我有大量标签...大约有100万个...

我该如何优化呢？

有什么办法吗？

编辑：

@Allendar和@ spencer7593和@jlhonora

感谢您的回答...他们对我非常有帮助...但是我不认为哪个答案是最好的...因为有出色的笔记和测试...

首先，按状态索引，然后删除子句...这非常有帮助...但是平均时间变成了大约1秒...

对于我的页面加载时间来说太多了（我的页面加载时间的平均值少于1秒...但是对第一个字节加载有不好的影响）

最终，我不得不将数据存储在一个文件中（通过玉米作业每隔一个小时），然后为每次页面加载从文件中打印数据！

感谢大家。

Answer 1

您可以执行以下操作：在name列上添加索引

Answer 2

假设您使用的是MySQL，请在name和state上创建一个复合索引：

CREATE INDEX name_index ON Tags (state, name);

感谢@Allendar和@ spencer7593正确使用它。

编辑：好的，我承认我可能在这一方面跳得很快。 因此，我编写了一个脚本来测试4种情况：

没有索引
名称索引
索引（状态，名称）
状态索引

TL; DR：最好的一种是选择3 ：

Results for tags
       user     system      total        real
   0.000000   0.000000   0.000000 (  1.321065)
Results for tag_index_names
       user     system      total        real
   0.000000   0.000000   0.000000 (  0.490763)
Results for tag_index_composites
       user     system      total        real
   0.000000   0.000000   0.000000 (  0.151101)
Results for tag_index_states
       user     system      total        real
   0.000000   0.000000   0.000000 (  1.289544)

这是完整的Ruby / ActiveRecord脚本：

require 'active_record'
require 'mysql2'
require 'benchmark'

db_name = 'test_db'
# Change the following to reflect your database settings
ActiveRecord::Base.establish_connection(
  adapter:  'mysql2', # or 'postgresql' or 'sqlite3'
  host:     'localhost',
  username: ENV['mysql_username'],
  database: db_name
)

ActiveRecord::Base.connection.execute("CREATE DATABASE IF NOT EXISTS #{db_name}")
ActiveRecord::Base.connection.execute("USE test_db")

class Tag < ActiveRecord::Base

end

class TagIndexName < ActiveRecord::Base

end

class TagIndexComposite < ActiveRecord::Base

end

class TagIndexState < ActiveRecord::Base

end

# Define a minimal database schema
unless ActiveRecord::Base.connection.table_exists?(:tags)
  ActiveRecord::Base.connection.create_table :tags, force: true do |t|
    t.string  :name
    t.integer :state
  end
end

unless ActiveRecord::Base.connection.table_exists?(:tag_index_names)
  ActiveRecord::Base.connection.create_table :tag_index_names, force: true do |t|
    t.string  :name, index: true
    t.integer :state
  end
end

unless ActiveRecord::Base.connection.table_exists?(:tag_index_states)
  ActiveRecord::Base.connection.create_table :tag_index_states, force: true do |t|
    t.string  :name
    t.integer :state, index: true
  end
end

unless ActiveRecord::Base.connection.table_exists?(:tag_index_composites)
  ActiveRecord::Base.connection.create_table :tag_index_composites, force: true do |t|
    t.string  :name
    t.integer :state
    t.index  [:state, :name]
  end
end

table_names = [Tag.table_name, TagIndexName.table_name, TagIndexComposite.table_name, TagIndexState.table_name]

table_names.each do |table_name|
  ActiveRecord::Base.connection.execute("TRUNCATE TABLE #{table_name}")
end

puts "Creating items"
100000.times.each do |i|
  name = SecureRandom.hex
  state = Random.rand(2)
  Tag.new(name: name, state: state).save!
  TagIndexName.new(name: name, state: state).save!
  TagIndexComposite.new(name: name, state: state).save!
  TagIndexState.new(name: name, state: state).save!
  if i > 0 && (i % 10000) == 0
    print i
  end
end
puts "Done creating items"

iterations = 1
table_names.each do |table_name|
  puts "Results for #{table_name}"
  Benchmark.bm do |bm|
    bm.report do
      iterations.times do
        ActiveRecord::Base.connection.execute("SELECT name, COUNT(name) count FROM #{table_name} WHERE state = 1 GROUP BY name ORDER BY count DESC LIMIT 7")
      end
    end
  end
end

Answer 3

对于此特定查询，最合适的索引是覆盖索引。

  CREATE INDEX Tags_IX1 ON Tags (state, name)

我们希望查询的EXPLAIN输出将显示正在使用的索引，在Extra列中使用“使用索引”，并避免了昂贵的“使用文件排序”操作。

因为在WHERE子句中有一个关于state的相等谓词，然后在name列上有一个group by操作，所以MySQL可以满足来自索引的查询，而无需执行“排序”操作，也无需对其中的页面进行任何查找。基础表。

仅在name列上创建索引的建议（在其他答案中）不足以实现此特定查询的最佳性能。

如果我们创建这样的索引：

  ... ON Tags (name,state)

以name为首列，那么我们可以重新编写查询以更有效地使用该索引：

  SELECT t.name
       , SUM(IF(t.state='1',t.name IS NOT NULL,NULL) AS count
    FROM Tags t
   GROUP BY t.name
   ORDER BY count DESC
   LIMIT 7

编辑

此处的其他答案建议在state列上添加索引。 似乎state可能具有较低的基数。 也就是说，该列只有很少的值，并且很大比例的行将具有值'1' 。 在那种情况下，基于正义state的索引不可能提供最佳性能。 那是因为使用该索引（如果MySQL甚至使用它）将需要查找基础数据页以检索name列，然后需要对所有行进行排序以满足GROUP BY。

使用EXPLAIN和Luke。

参考： 8.8.1使用EXPLAIN优化查询 https://dev.mysql.com/doc/refman/5.6/en/using-explain.html

跟进

@Allendar声称（在对该答案的评论中）该答案是错误的。 他说我建议的覆盖索引“不会提高性能”，并说单列state的索引（如他的回答中所建议）是正确的答案。 他还建议进行测试。

所以，这是一个测试。

SQL小提琴在这里： http ://sqlfiddle.com/#!9/20e73/2

（请耐心打开该SQL Fiddle链接...正在填充一百万行以上的表，构建四个索引，并运行十五个查询，因此它旋转了十几秒钟。）

以下是在本地计算机上运行MySQL 5.6的结果：

run   no index     (state,name)  (name,state)  (state)      (name)
----  -----------  ------------  ------------  -----------  -----------
run1   2.410 sec    0.687 sec     1.076 sec     3.374 sec    3.924 sec
run2   2.433 sec    0.659 sec     1.074 sec     3.267 sec    3.958 sec
run3   2.851 sec    0.717 sec     1.024 sec     3.423 sec    4.222 sec

最快的是(state,name)上的多列索引
第二快的是(name,state)多列索引
第三快的是桌子的全扫描
排在第四位，比表扫描慢，索引在(state)
最后，在(name)列上的索引

在SQL Fiddle上运行时，结果相似：

         none     (s,n)    (n,s)   (n)     (s)
 ----    ------   ------   ------  ------  ------
 run1     701ms    193ms    286ms  1462ms   959ms
 run2     707ms    191ms    282ms  1170ms   957ms
 run3     702ms    190ms    283ms  1157ms   914ms

测试结果表明(state,name)上的多列索引是获胜者。

测试结果还表明，与仅在state列上使用索引相比，全表扫描速度更快 。 也就是说，通过告诉MySQL仅忽略 state列上的索引，我们可以获得更好的性能。

Answer 4

在您的state字段上创建一个索引。 这就是为什么；

BTREE INDEX在state字段中进行搜索查询 （又名WHERE子句）。 现在将发生的事情是BTREE将像这样索引您的state值；

1-> 11-> 11-> 112

2-> 21-> 22-> 221

现在，假设您的结果中有100k state ID为1 state 。 它将询问BTREE INDEX分支，并从1开始。 它不需要更深入，因为它已经找到了。 现在，在该分支下，它可以立即从表中知道它需要的所有唯一记录，并且将很快根据状态来查找名称。

供将来参考； 如果您还在name 和 state上执行WHERE操作，那么您需要在name和state上创建合并的INDEX，因此BTREE会将它们两者的更复杂的INDEX组合在一起，并且也会改进这些查询。

希望这可以帮助。

祝好运！

如何优化SQL查询以更快地处理大数据？

问题描述

4 个解决方案

解决方案1
3 2015-06-25 17:42:36

解决方案2
3 2015-06-25 17:44:22

解决方案3
3 2015-06-25 17:57:09

解决方案4
1 已采纳

如何优化SQL查询以更快地处理大数据？

问题描述

4 个解决方案

解决方案1 3 2015-06-25 17:42:36

解决方案2 3 2015-06-25 17:44:22

解决方案3 3 2015-06-25 17:57:09

解决方案4 1 已采纳

解决方案1
3 2015-06-25 17:42:36

解决方案2
3 2015-06-25 17:44:22

解决方案3
3 2015-06-25 17:57:09

解决方案4
1 已采纳