简体   繁体   English

在关系数据库中高效实现分面搜索

[英]Efficient implementation of faceted search in relational databases

I am trying to implement a Faceted search or tagging with multiple-tag filtering. 我正在尝试使用多标签过滤实现分面搜索或标记。 In the faceted navigation, only not-empty categories are displayed and the number of items in the category that are also matching already applied criteria is presented in parenthesis. 在分面导航中,仅显示非空类别,并且在括号中显示类别中也匹配已应用标准的项目数。

I can get all items having assigned categories using INNER JOINs and get number of items in all category using COUNT and GROUP BY , however I'm not sure how it will scale to millions of objects and thousands of tags. 我可以使用INNER JOIN获取所有已分配类别 的项目,并使用COUNT和GROUP BY获取所有类别中的项目数 ,但是我不确定它将如何扩展到数百万个对象和数千个标记。 Especially the counting. 特别是计数。

I know that there are some not-relational solutions like Lucene + SOLR , but I've found also some closed-source RDBMS-based implementations that are said to be entreprise-strength like FacetMap.com or Endeca software, so there must be an efficient way to perform faceted search in relational databases. 我知道有一些非关系解决方案,比如Lucene + SOLR ,但我发现一些基于闭源RDBMS的实现据说是像FacetMap.comEndeca软件一样具有企业实力,所以必须有一个在关系数据库中执行分面搜索的有效方法。

Does anybody have experience in faceted search and could give some tips? 有没有人有分面搜索的经验,可以提供一些提示?

Cache the counts for each category set? 缓存每个类别集的计数? Maybe use some smart incremental technique that will update the counters? 也许使用一些智能增量技术来更新计数器?

Edit: 编辑:

An example of faceted navigation can be found here: Flamenco . 可以在此处找到分面导航的示例: 弗拉门戈

Currently I have the standard 3-table scheme (items, tags and items_tags like described here: http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html#toxi ) plus a table for facets. 目前我有标准的3表方案(项目,标签和items_tags,如下所述: http ://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html#toxi)加上一个表对于方面。 Each tag has assigned a facet. 每个标签都分配了一个方面。

IMO, relational databases aren't that good at searching. IMO,关系数据库并不擅长搜索。 You would get better performance from a dedicated search engine (like Solr/Lucene). 您可以通过专用搜索引擎(如Solr / Lucene)获得更好的性能。

I can only confirm what Nils says. 我只能确认尼尔斯说的话。 RDBMS are not good for multi-dimensional searching. RDBMS不适合多维搜索。 I have worked with some smart solutions, caching counters, using triggers, and so on. 我使用过一些智能解决方案,缓存计数器,使用触发器等等。 But in the end, external dedicated indexer always wins. 但最终,外部专用索引器总能获胜。

MAYBE, if you transform your data into dimensional model and feed it to some OLAP [I mean MDX engine] - it will perform well. 可能,如果您将数据转换为维度模型并将其提供给某些OLAP [我的意思是MDX引擎] - 它将表现良好。 But it seems a bit too heavy solution, and it will be definitely NOT real-time. 但它似乎有点太沉重的解决方案,它绝对不是实时的。

On the contrary, solution with dedicated indexing engine (think Lucene, think Sphinx ) can be made near-real time with incremental index updates. 相反,具有专用索引引擎的解决方案(想想Lucene,想想Sphinx )可以通过增量索引更新近乎实时地进行。

Faceted Search is an analytic problem, which means dimensional design is a good bet. 分面搜索是一个分析问题,这意味着尺寸设计是一个不错的选择。 Aka, the thing you search against must be in tabular form. Aka,你搜索的东西必须是表格形式。

Include all columns of interest in your analytic table. 在分析表中包括所有感兴趣的列。

Put continuous values into buckets. 将连续值放入存储桶中。

Use boolean columns for "many" items like categories or tags, example if there are three tags "foo", "bar", and "baz", you would have three boolean columns. 对类别或标签等“很多”项使用布尔列,例如,如果有三个标签“foo”,“bar”和“baz”,则会有三个布尔列。

Use a materialized view to create your analytic table. 使用物化视图创建分析表。

Index the crap out of it. 索引废话。 Some databases support indexes for this type of application. 某些数据库支持此类应用程序的索引。

Only filter once. 只过滤一次。

Union your results. 联合你的结果。

Build pre-aggregated materialized views for common queries. 为常见查询构建预聚合的物化视图。

This article might help you too: https://blog.jooq.org/2017/04/20/how-to-calculate-multiple-aggregate-functions-in-a-single-query/ 本文也可能对您有所帮助: https//blog.jooq.org/2017/04/20/how-to-calculate-multiple-aggregate-functions-in-a-single-query/

with filtered as (
    select
    *
    from cars_analytic
    where
        [some search conditions]
)

--for each facet:

select
    'brand' as facet,
    brand as value,
    count(*) as count
from
    filtered
group by
    brand

union

select
    'cool-tag' as facet,
    'cool-tag'as value,
    count(*) as count
from
    filtered
where
    cool_tag

union

...


-- sort at the end
order by
    facet,
    count desc,
    value

100,000 records with 5 facets in ~ 150 ms 100,000个记录,5个刻面在~150毫秒

Regarding the counts, why pull them via SQL? 关于计数,为什么要通过SQL拉它们? You'll have to iterate through the result set in your code anyway, so why not make your count there? 无论如何,你必须遍历代码中的结果集,那么为什么不在那里计算呢?

I'm currently using this approach in a faceted search app I'm developing and it's working fine. 我正在使用这种方法在我正在开发的分面搜索应用程序中,它工作正常。 The only tricky part is to setup your code to not output the facet until it reaches a new facet. 唯一棘手的部分是将代码设置为不输出构面,直到它到达新的构面。 At that time, output the facet and the number of rows you found for it. 此时,输出facet以及为其找到的行数。

This approach assumes you're pulling back a list of all matching items, and thus, multiple rows with the same facet. 此方法假设您正在拉回所有匹配项的列表,因此,多个行具有相同的facet。 When you order this result by facet it's easy to get the count in your code instead. 当您通过facet订购此结果时,很容易在代码中获取计数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM