高性能多层标签过滤

Question

I have a large database of artists, albums, and tracks. 我有一个庞大的艺术家，专辑和曲目数据库。 Each of these items may have one or more tags assigned via glue tables (track_attributes, album_attributes, artist_attributes). 这些项目中的每一个都可以具有一个或多个通过粘合表分配的标签（track_attributes，album_attributes，artist_attributes）。 There are several thousand (or even hundred thousand) tags applicable to each item type. 每个项目类型都有数千个（甚至十万个）标签。

I am trying to accomplish two tasks, and I'm having a very hard time getting the queries to perform acceptably. 我正在尝试完成两项任务，并且要让查询执行令人满意的过程非常困难。

Task 1) Get all tracks that have any given tags (if provided) by artists that have any given tags (if provided) on albums with any given tags (if provided). 任务1）获取专辑中具有给定标签（如果提供）的具有给定标签（如果提供）的艺术家的所有具有给定标签（如果提供）的曲目。 Any set of tags may not be present (ie only a track tag is active, no artist or album tags) 可能没有任何标签集（即，只有曲目标签处于活动状态，没有艺术家或专辑标签）

Variation: The results are also presentable by artist or by album rather than by track 变化形式：结果也可以按艺术家或专辑显示，而不是按曲目显示

Task 2) Get a list of tags that are applied to the results from the previous filter, along with a count of how many tracks have each given tag. 任务2）获取应用于上一个过滤器结果的标签列表，以及每个给定标签有多少轨道的计数。

What I am after is some general guidance in approach . 我所追求的是方法上的一些一般性指导 。 I have tried temp tables, inner joins, IN(), all my efforts thus far result in slow responses. 我尝试过临时表，内部联接，IN（），到目前为止，我所有的努力都导致响应缓慢。 A good example of the results I am after can be seen here: http://www.yachtworld.com/core/listing/advancedSearch.jsp , except they only have one tier of tags, I am dealing with three. 我追求的结果的一个很好的例子可以在这里看到： http : //www.yachtworld.com/core/listing/advancedSearch.jsp ，除了它们只有一层标签，我正在处理三层。

Table structures: 表结构：

Table: attribute_tag_groups
   Column   |          Type               |   
------------+-----------------------------+
 id         | integer                     |
 name       | character varying(255)      | 
 type       | enum (track, album, artist) | 

Table: attribute_tags
   Column                       |          Type               |   
--------------------------------+-----------------------------+
 id                             | integer                     |
 attribute_tag_group_id         | integer                     |
 name                           | character varying(255)      | 

Table: track_attribute_tags
   Column   |          Type               |   
------------+-----------------------------+
 track_id   | integer                     |
 tag_id     | integer                     | 

Table: artist_attribute_tags
   Column   |          Type               |   
------------+-----------------------------+
 artist_id  | integer                     |
 tag_id     | integer                     | 

Table: album_attribute_tags
   Column   |          Type               |   
------------+-----------------------------+
 album_id   | integer                     |
 tag_id     | integer                     | 

Table: artists
   Column   |          Type               |   
------------+-----------------------------+
 id         | integer                     |
 name       | varchar(350)                | 

Table: albums
   Column   |          Type               |   
------------+-----------------------------+
 id         | integer                     |
 artist_id  | integer                     | 
 name       | varchar(300)                | 

Table: tracks
   Column    |          Type               |   
-------------+-----------------------------+
 id          | integer                     |
 artist_id   | integer                     | 
 album_id    | integer                     | 
 compilation | boolean                     | 
 name        | varchar(300)                |

EDIT I am using PHP, and I am not opposed to doing any sorting or other hijinx in script, my #1 concern is speed of return. 编辑我正在使用PHP，并且我不反对在脚本中进行任何排序或其他操作，我的＃1关注点是返回速度。

Answer 1

If you want speed, I would suggest you look into Solr/Lucene. 如果您想提高速度，我建议您研究Solr / Lucene。 You can store your data, and have very speedy lookups by calling Solr and parsing the result from PHP. 您可以通过调用Solr并从PHP解析结果来存储数据并进行快速查找。 And as an added benefit you get faceted searches as well (which is task 2 of your question if I interpret it correctly). 另外，您也可以获得多面搜索（如果我正确解释的话，这是您问题的任务2）。 The downside is of course that you might have redundant information (once stored in DB, once in the Solr document store). 缺点当然是您可能有多余的信息（一次存储在DB中，一次存储在Solr文档存储中）。 And it does take a while to setup (well, you could learn a lot from Drupal Solr integration). 而且设置起来确实需要一段时间（嗯，您可以从Drupal Solr集成中学到很多东西）。

Just check out the PHP reference docs for Solr . 只需查看Solr的PHP参考文档即可。

Here's on article on how to use Solr with PHP, just in case : http://www.ibm.com/developerworks/opensource/library/os-php-apachesolr/ . 这里是有关如何在PHP中使用Solr的文章，以防万一： http : //www.ibm.com/developerworks/opensource/library/os-php-apachesolr/ 。

Answer 2

You probably should try to denormalize your data. 您可能应该尝试对数据进行非规范化。 Your structure is optimised for insert/update load, but not for queries. 您的结构针对插入/更新负载进行了优化，但不适用于查询。 As I got it, your will have much more select queries than insert/update queries. 据我了解，选择查询比插入/更新查询要多得多。

For example you can do something like this: 例如，您可以执行以下操作：

store your data in normalized structure. 将数据存储在规范化的结构中。

create agregate table like this 创建这样的汇总表

  track_id, artist_tags, album_tags, track_tags
   1 , jazz/pop/,  jazz/rock, /heavy-metal/  

    or 

    track_id, artist_tags, album_tags, track_tags
    1 , 1/2/,  1/3, 4/

to spead up search you probably should create FULLTEXT index on *_tags columns 为了加快搜索速度，您可能应该在* _tags列上创建FULLTEXT索引

query this table with sql like 用sql查询此表

select * from aggregate where album_tags  MATCH (track_tags) AGAINST ('rock')

rebuild this table incrementally once a day. 每天递增一次重建此表。

Answer 3

I think the answer greately depends on how much money you wish to spend on your project - there are some tasks that are even theoretically impossible to accomplish given strict conditions(for example that you must use only one weak server). 我认为答案很大程度上取决于您希望在项目上花费多少资金-在严格的条件下，有些理论上甚至无法完成的任务（例如，您只能使用一台弱服务器）。 I will assume that you are ready to upgrade your system. 我将假定您已准备好升级系统。

First of all - your table structure forces JOIN's - I think you should avoid them if possible when writing high performace applications. 首先-您的表结构强制执行JOIN-我认为在编写高性能应用程序时应尽可能避免使用它们。 I don't know "attribute_tag_groups" is, so I propose a table structure: tag(varchar 255), id(int), id_type(enum (track, album, artist)). 我不知道“ attribute_tag_groups”是什么，所以我提出了一个表结构：tag（varchar 255），id（int），id_type（enum（曲目，专辑，艺术家））。 Id can be artist_id,track_id or album_id depending on id_type. ID可以是artist_id，track_id或album_id，具体取决于id_type。 This way you will be able too lokup all your data in one table, but of cource it will use much more memory. 这样，您就可以在一个表中查找所有数据，但是，当然，它将使用更多的内存。

Next - you should consider using several databases. 接下来-您应该考虑使用多个数据库。 It will help even more if each database contains only part of your data(each lookup will be faster). 如果每个数据库仅包含部分数据，则将提供更多帮助（每次查找都将更快）。 Deciding how to spread your data between databases is usually rather hard task: I suggest you make some statistics about tag length, find ranges of length that will get similar trac/artists results count and hard-code it into your lookup code. 确定如何在数据库之间分配数据通常是一项艰巨的任务：我建议您对标签长度进行一些统计，找到将获得类似笔迹/艺术家结果计数的长度范围，并将其硬编码为您的查找代码。

Of cource you should consider MySql tuning(I am sure you did that, but just in case) - all your tables should reside in RAM - if that is impossible try to get SSD discs, raids etc.. Proper indexing and database types/settings are really important too (MySql may even show some bottlenecks in internal statistics). 当然，您应该考虑MySql调整（我确定您这样做，但以防万一）-您所有的表都应驻留在RAM中-如果这不可能，请尝试获取SSD磁盘，RAID等。正确的索引编制和数据库类型/设置同样非常重要（MySql甚至可能在内部统计数据中显示一些瓶颈）。

This suggestion may sound mad - but sometimes it is good to let PHP do some calculations that MySql can do itself. 这个建议听起来很疯狂-但有时最好让PHP做一些MySql可以自己做的计算。 MySql databases are much harder to scale, while a server for PHP processing can be added in in the matter of minutes. MySql数据库很难扩展，而可以在几分钟内添加用于PHP处理的服务器。 And different PHP threads can run on different CPU cores - MySql have problems with it. 而且不同的PHP线程可以在不同的CPU内核上运行-MySql存在问题。 You can increase your PHP performace by using some advanced modules(you can even write them yourself - profile your PHP scripts and hard code bottlenecks in fast C code). 您可以通过使用一些高级模块来提高PHP的性能（您甚至可以自己编写它们-在快速的C代码中分析PHP脚本和硬代码瓶颈）。

Last but I think the most important - you must use some type of caching. 最后但我认为最重要的-您必须使用某种类型的缓存。 I know that it is really hard, but I don't think that there was any big project without a really good caching system. 我知道这确实很难，但是我认为没有一个非常好的缓存系统就不会有任何大型项目。 In your case some tags will surely be much more popular then others, so it should greately increase performance. 在您的情况下，某些标签肯定会比其他标签更受欢迎，因此它应该极大地提高性能。 Caching is a form of art - depending on how much time you can spend on it and how much resources are avaliable you can make 99% of all requests use cache. 缓存是一种艺术形式-取决于您可以花多少时间和可用的资源，可以使99％的所有请求使用缓存。

Using other databases/indexing tools may help you, but you should always consider theoretical query speed comparison(O(n), O(nlog(n))...) to understand if they can really help you - using this tools sometimes give you low performance gain(like constant 20%), but they may complicate your application design and most of the time it is not worth it. 使用其他数据库/索引工具可能会对您有所帮助，但是您应该始终考虑理论上的查询速度比较（O（n），O（nlog（n））...），以了解它们是否真的可以为您提供帮助-使用此工具有时可以您的性能增益较低（例如恒定为20％），但是它们可能会使您的应用程序设计复杂化，并且在大多数情况下不值得。

Answer 4

From my experience most 'slow' MySQL database doesn't have correct index and/or queries. 根据我的经验，大多数“慢速” MySQL数据库没有正确的索引和/或查询。 So I would check these first: 因此，我将首先检查这些：

Make sure all data talbes' id fields is primary index. 确保所有数据表的id字段均为主索引。 Just in case. 以防万一。
For all data tables, create an index on the external id fields and then the id, so that MySQL can use it in search. 对于所有数据表，在外部ID字段上创建索引，然后在ID上创建索引，以便MySQL可以在搜索中使用它。
For your glue tables, setting a primary key on the two fields, first the subject, then the tag. 对于胶水表，在两个字段上设置主键，首先是主题，然后是标签。 This is for normal browsing. 这是用于正常浏览。 Then create a normal index on the tag id. 然后在标签ID上创建一个普通索引。 This is for searching. 这是用于搜索。
Still slow? 还慢吗？ Are you using MyISAM for your tables? 您是否在为表使用MyISAM？ It is designed for quick queries. 它设计用于快速查询。
If still slow, run an EXPLAIN on a slow query and post both the query and result in the question. 如果仍然很慢，请对慢速查询运行EXPLAIN，然后将查询和结果都发布。 Preferably with an importable sql dump of your complete database structure. 最好使用完整数据库结构的可导入sql转储。

Answer 5

Things you may give a try: 您可以尝试的事情：

Use a Query Analyzer to explore the bottlenecks of your querys. 使用查询分析器来探索查询的瓶颈。 (In most times the underlying DBS is quite doing an amazing job in optimizing) （在大多数情况下，基础DBS在优化方面做得非常出色）
Your table structure is well normalized but personal experience showed me that you can archive much greater performance levels with structures that enable you to avoid joins& subquerys. 您的表结构已经很好地规范化了，但是个人经验告诉我，您可以使用可避免联接和子查询的结构来归档更高的性能级别。 For your case i would suggest to store the tag information in one field. 对于您的情况，我建议将标签信息存储在一个字段中。 (This requires support by the underlying DBS) （这需要基础DBS的支持）

So far. 至今。

Answer 6

Check your indices, and if they are used correctly. 检查您的索引，以及它们是否被正确使用。 Maybe MySQL isn't up to the task. 也许MySQL无法胜任这项任务。 PostgreSQL should be similiar to use but has better performance in complex situations. PostgreSQL应该易于使用，但在复杂情况下性能会更好。

On a completely different track, google map-reduce and use one of these new fancy no-SQL databases for really really large data sets. 在完全不同的轨道上，google map减少并使用这些新型的无SQL数据库之一来处理非常大的数据集。 This can do distributed search on multiple servers in parallel. 这样可以在多个服务器上并行进行分布式搜索。

高性能多层标签过滤

问题描述

6 个解决方案

解决方案1
3 2011-08-05 18:30:02

解决方案2
2 已采纳 2011-08-08 14:49:29

解决方案3
2 2011-08-10 08:59:20

解决方案4
1 2011-08-14 12:34:35

解决方案5
0 2011-08-08 14:54:01

解决方案6
0 2011-08-15 09:57:31

高性能多层标签过滤

问题描述

6 个解决方案

解决方案1 3 2011-08-05 18:30:02

解决方案2 2 已采纳 2011-08-08 14:49:29

解决方案3 2 2011-08-10 08:59:20

解决方案4 1 2011-08-14 12:34:35

解决方案5 0 2011-08-08 14:54:01

解决方案6 0 2011-08-15 09:57:31

解决方案1
3 2011-08-05 18:30:02

解决方案2
2 已采纳 2011-08-08 14:49:29

解决方案3
2 2011-08-10 08:59:20

解决方案4
1 2011-08-14 12:34:35

解决方案5
0 2011-08-08 14:54:01

解决方案6
0 2011-08-15 09:57:31