简体   繁体   English

如何在PostgreSQL中有效地设置减去连接表?

[英]How to efficiently set subtract a join table in PostgreSQL?

I have the following tables: 我有以下表格:

  • work_units - self explanatory work_units - 自我解释
  • workers - self explanatory workers - 自我解释
  • skills - every work unit requires a number of skills if you want to work on it. skills - 如果你想要工作,每个工作单位都需要很多技能。 Every worker is proficient in a number of skills. 每个工人都精通各种技能。
  • work_units_skills - join table work_units_skills - 连接表
  • workers_skills - join table workers_skills - 加入表

A worker can request the next appropriate free highest priority (whatever that means) unit of work to be assigned to her. 工作人员可以请求下一个适当的免费最高优先级(无论这意味着)分配给她的工作单元。


Currently I have: 目前我有:

SELECT work_units.*
FROM work_units
-- some joins
WHERE NOT EXISTS (
        SELECT skill_id
        FROM work_units_skills
        WHERE work_unit_id = work_units.id

        EXCEPT

        SELECT skill_id
        FROM workers_skills
        WHERE worker_id = 1 -- the worker id that made the request
      )
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
FOR UPDATE SKIP LOCKED;

This condition makes the query 8-10 times slower though. 这种情况使查询慢了8-10倍。

Is there a better way to express that a work_units 's skills should be a subset of the workers 's skills or something to improve the current query? 是否有更好的方式来表达work_units的技能应该是workers技能的一部分或者是什么来改善当前的查询?


Some more context: 更多背景:

  • The skills table is fairly small. skills表相当小。
  • Both work_units and workers tend to have very few associated skills. work_unitsworkers往往很少有相关的技能。
  • work_units_skills has index on work_unit_id . work_units_skills对指数work_unit_id
  • I tried moving the query on workers_skills into a CTE. 我尝试将workers_skills上的查询移动到CTE中。 This gave a slight improvement (10-15%), but it's still too slow. 这略有改善(10-15%),但仍然太慢。
  • A work unit with no skill can be picked up by any user. 任何用户都可以选择没有技能的工作单位。 Aka an empty set is a subset of every set. Aka空集是每一组的子集。

One simple speed-up would be to use EXCEPT ALL instead of EXCEPT . 一个简单的加速就是使用EXCEPT ALL而不是EXCEPT The latter removes duplicates, which is unnecessary here and can be slow. 后者删除重复项,这在这里是不必要的,可能很慢。

An alternative that would probably be faster is to use a further NOT EXISTS instead of the EXCEPT : 可能更快的替代方法是使用另外的NOT EXISTS而不是EXCEPT

...
WHERE NOT EXISTS (
        SELECT skill_id
        FROM work_units_skills wus
        WHERE work_unit_id = work_units.id
        AND NOT EXISTS (
            SELECT skill_id
            FROM workers_skills ws
            WHERE worker_id = 1 -- the worker id that made the request
              AND ws.skill_id = wus.skill_id
        )
      )

Demo 演示

http://rextester.com/AGEIS52439 - with the the LIMIT removed for testing http://rextester.com/AGEIS52439 - 删除LIMIT进行测试

(see UPDATE below) (见下面的更新

This query finds a good work_unit using a simple LEFT JOIN to find a missing skill in the shorter table of skills the requesting worker has. 此查询使用简单的LEFT JOIN找到一个好的work_unit以便在请求工作者具有的较短技能表中找到缺少的技能。 The trick is whenever there is a missing skill, there will be a NULL value in the join and this is translated to a 1 and the work_unit is removed by leaving the ones with all 0 values ie having a max of 0 . 诀窍在于,只要缺少技能,连接中就会有一个NULL值,并将其转换为1并通过将具有全0值的值(即max0 )删除work_unit

Being classic SQL this would be the most heavily targeted query for optimization by the engine: 作为经典SQL,这将是引擎优化的最有针对性的查询:

SELECT work_unit_id
FROM
  work_units_skills s
LEFT JOIN
  (SELECT skill_id FROM workers_skills WHERE worker_id = 1) t
ON (s.skill_id=t.skill_id)
GROUP BY work_unit_id
HAVING max(CASE WHEN t.skill_id IS NULL THEN 1 ELSE 0 END)=0
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
FOR UPDATE SKIP LOCKED;

UPDATE UPDATE

In order to catch work_units with no skills, we throw the work_units table into the JOIN: 为了赶上work_units没有技能,我们扔work_units表到JOIN:

SELECT r.id AS work_unit_id
FROM
  work_units r
LEFT JOIN
  work_units_skills s ON (r.id=s.work_unit_id)
LEFT JOIN
  (SELECT skill_id FROM workers_skills WHERE worker_id = 1) t
ON (s.skill_id=t.skill_id)
GROUP BY r.id
HAVING bool_or(s.skill_id IS NULL) OR bool_and(t.skill_id IS NOT NULL)
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
FOR UPDATE SKIP LOCKED;

You may use the following query 您可以使用以下查询

SELECT wu.*
FROM work_units wu
LEFT JOIN work_units_skills wus ON wus.work_unit_id = wu.id and wus.skill_id IN (
    SELECT id
    FROM skills
    EXCEPT
    SELECT skill_id
    FROM workers_skills
    WHERE worker_id = 1 -- the worker id that made the request
)
WHERE wus.work_unit_id IS NULL;  

demo (thanks, Steve Chambers for most of the data) 演示 (感谢Steve Chambers的大部分数据)

You should definitely have index on work_units_skills(skill_id) , workers_skills(worker_id) and work_units(id) . 你肯定应该有work_units_skills(skill_id)workers_skills(worker_id)work_units(id) If you want to speed it up, even more, create indexes work_units_skills(skill_id, work_unit_id) and workers_skills(worker_id, skill_id) which avoid accessing those tables. 如果您想加快速度,甚至更多,请创建索引work_units_skills(skill_id, work_unit_id)workers_skills(worker_id, skill_id) ,以避免访问这些表。

The subquery is independent and outer join should relatively fast if the result is not large. 子查询是独立的,如果结果不大,外连接应该相对较快。

Bit-Mask Solution 位掩码解决方案
Without any changes in your previous Database Design, just add 2 fields. 如果您以前的数据库设计没有任何更改,只需添加2个字段。
First: a long or bigint (related to your DBMS) into Workers 第一:对工人的长期或bigint(与您的DBMS相关)
Second: another long or bigint into Work_Units 第二:另一个长或大的Work_Units

These fields show skills of work_units and skills of workers. 这些领域显示了work_units技能和工人技能。 For example suppose that you have 8 records in Skills table. 例如,假设您在技能表中有8条记录。 (notice that records of skill in small) (注意技能记录小)
1- some skill 1 1-一些技巧1
2- some skill 2 2-一些技巧2
... ...
8- some skill 8 8-一些技巧8

then if we want to set skills 1,3,6,7 to one work_unit, just use this number 01100101. 那么如果我们想将技能1,3,6,7设置为一个work_unit,只需使用此号码01100101。
(I offer to use reversed version of binary 0,1 placement to support additional skills in future.) (我提议使用反向版的二进制0,1位置以支持将来的其他技能。)

In practice you can use 10 base number to add in database (101 instead of 01100101) 在实践中,你可以使用10个基数来添加数据库(101而不是01100101)

Similar number can be generated to workers. 可以为工人生成类似的数字。 Any worker choose some skills. 任何工人都选择一些技能。 So we can turn the selected items into a number and save it in additional field in Worker table. 因此,我们可以将所选项目转换为数字,并将其保存在Worker表格的其他字段中。

Finally , to find appropriate work_units subset for any worker JUST select from work_units and use bitwise AND like below. 最后 ,要为任何工作者找到合适的work_units子集,必须从work_units中选择并使用按位AND,如下所示。
A: new_field_of_specific_worker (shows Skills of each Worker) that we are searching works_units related to him/her right now. A: new_field_of_specific_worker(显示每个工人的技能)我们正在搜索与他/她相关的works_units。
B: new_field_of_work_units that shows Skills of each work_unit B: new_field_of_work_units,显示每个work_unit的技能

select * from work_units
where A & B  = B

Notice: 注意:
1: absolutely, this is fastest way but it has some difficulties. 1:绝对,这是最快的方式,但它有一些困难。
2: we have some extra difficulties when a new skill is Added or to be Delete. 2:添加新技能或删除新技能时会遇到一些额外的困难。 But this is a trade-off. 但这是一种权衡。 Adding or Deleting new skills less happens. 添加或删除新技能的可能性较小。
3: we should use skills and work_unit_skills and workers_skills too. 3:我们也应该使用技能和work_unit_skills和workers_skills。 But in search, we just use new fields 但在搜索中,我们只使用新字段


Also, this approach can be used for TAG Management systems like Stack Overflow TAGs. 此外,此方法可用于TAG管理系统,如Stack Overflow TAG。

With the current info I can only answer on a hunch. 根据目前的信息,我只能预感回答。 Try removing the EXCEPT-statement and see if it gets significantly faster. 尝试删除EXCEPT语句,看看它是否明显更快。 If it does, you can add that part again, but using WHERE-conditions. 如果是,您可以再次添加该部分,但使用WHERE条件。 In my experience set operators (MINUS/EXCEPT, UNION, INTERSECT) are quite the performance killers. 根据我的经验,操作员(MINUS / EXCEPT,UNION,INTERSECT)是性能杀手。

The correlated sub-query is punishing you, especially with the additional use of EXCEPT. 相关的子查询正在惩罚你,尤其是额外使用EXCEPT。

To paraphrase your query, you're only interested in a work_unit_id when a specified worker has ALL of that work_unit's skills? 为了解释您的查询,当指定的工作人员拥有所有work_unit的技能时,您只对work_unit_id感兴趣? (If a work_unit has a skill associated with it, but the specified user doesn't have that skill, exclude that work_unit?) (如果work_unit具有与之关联的技能,但指定的用户没有该技能,请排除该work_unit?)

This can be achieve with JOINs and GROUP BY, and no need for correlation at all. 这可以通过JOIN和GROUP BY实现,根本不需要相关。

SELECT
    work_units.*
FROM
    work_units
--
-- some joins
--
INNER JOIN
(
    SELECT
        wus.work_unit_id
    FROM
        work_unit_skills   wus
    LEFT JOIN
        workers_skills     ws
            ON  ws.skill_id  = wus.skill_id
            AND ws.worker_id = 1
    GROUP BY
        wus.work_unit_id
    HAVING
        COUNT(wus.skill_id) = COUNT(ws.skill_id)
)
     applicable_work_units
         ON  applicable_work_units.work_unit_id = work_units.id
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1

The sub-query compares a worker's skill set to each work unit's skill set. 子查询将工人的技能组与每个工作单位的技能组进行比较。 If there are any skills the work unit has that the worker doesn't then ws.skill_id will be NULL for that row, and as NULL is ignored by COUNT() this means that COUNT(ws.skill_id) will then be lower than COUNT(wus.skill_id) , and so that work_unit would become excluded from the sub-query's results. 如果有任何技能的工作单位有工人不那么ws.skill_id将是NULL该行,并为NULL被忽略COUNT()这意味着COUNT(ws.skill_id)然后将低于COUNT(wus.skill_id) ,以便work_unit将被排除在子查询的结果之外。

This assumes that the workers_skills table is unique over (work_id, skill_id) and that the work_unit_skills table is unique over (work_unit_id, skill_id) . 这假定workers_skills表对(work_id, skill_id)是唯一的,并且work_unit_skills表对于(work_unit_id, skill_id)是唯一的。 If that's not the case then you may want to tinker with the HAVING clause (such as COUNT(DISTINT wus.skill_id) , etc) . 如果不是这种情况,那么您可能想要修改HAVING子句(例如COUNT(DISTINT wus.skill_id)等)


EDIT: 编辑:

The above query assumes that only relatively low number of work units would match the criteria of matching a specific worker. 上述查询假设只有相对较少数量的工作单元符合匹配特定工作者的标准。

If you assume that a relatively large number of work units would match, the opposite logic would be faster. 如果假设相对大量的工作单元匹配,则相反的逻辑会更快。

(Essentially, try to make the number of rows returns by the sub-query as low as possible.) (基本上,尝试使子查询返回的行数尽可能低。)

SELECT
    work_units.*
FROM
    work_units
--
-- some joins
--
LEFT JOIN
(
    SELECT
        wus.work_unit_id
    FROM
        work_unit_skills   wus
    LEFT JOIN
        workers_skills     ws
            ON  ws.skill_id  = wus.skill_id
            AND ws.worker_id = 1
    WHERE
        ws.skill_id IS NULL
    GROUP BY
        wus.work_unit_id
)
     excluded_work_units
         ON  excluded_work_units.work_unit_id = work_units.id
WHERE
    excluded_work_units.work_unit_id IS NULL
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1

This one compares all work unit skills with those of the worker, and only keeps rows where the work unit has skills that the worker does not have. 这个工作单元技能与工人的技能进行比较,只保留工作单位具有工人没有技能的行。

Then, GROUP BY the work unit to get a list of work units that need to be ignored. 然后, GROUP BY工作单元获取需要忽略的工作单元列表。

By LEFT joining these on to your existing results, you can stipulate you only want to include a work unit if it doesn't appear in the sub-query by specifying excluded_work_units.work_unit_id IS NULL . 通过LEFT这些结果加入到现有结果中,您可以通过指定excluded_work_units.work_unit_id IS NULL来规定您只想包含工作单元(如果它出现在子查询中)。

Useful online guides will refer to anti-join and anti-semi-join . 有用的在线指南将引用anti-joinanti-semi-join


EDIT: 编辑:

In general I would recommend against the use of a bit-mask. 一般来说,我建议不要使用位掩码。

Not because it's slow, but because it defies normalisation. 不是因为它很慢,而是因为它无法正常化。 The existence of a single field representing multiple items of data is a general sql-code-smell / sql-anti-pattern, as the data is no longer atomic. 表示多个数据项的单个字段的存在是一般的sql-code-smell / sql-anti-pattern,因为数据不再是原子的。 (This leads to pain down the road, especially if you reach a world where you have so many skills that they no longer all fit in to the data type chosen for the bit-mask, or when it comes to managing frequent or complex changes to the skill sets.) (这会带来痛苦,特别是如果你到达一个你拥有如此多技能的世界,以至于他们不再适合为比特掩码选择的数据类型,或者在管理频繁或复杂的变化时技能组合。)

That said, if performance continues to be an issue, de-normalisation is often a very useful option. 也就是说,如果性能仍然是一个问题,去标准化往往是一个非常有用的选择。 I'd recommend keeping the bit masks in separate tables to make it clear that they're de-normalised / cached calcualtion results. 我建议将位掩码保存在单独的表中,以明确它们是非规范化/缓存的计算结果。 In general though, such options should be a last resort rather than a first reaction. 但总的来说,这些选择应该是最后的手段,而不是第一反应。


EDIT: Example revisions to always include work_units that have no skills... 编辑:示例修订始终包含没有技能的work_units ...

SELECT
    work_units.*
FROM
    work_units
--
-- some joins
--
INNER JOIN
(
    SELECT
        w.id   AS work_unit_id
    FROM
        work_units          w
    LEFT JOIN
        work_units_skills   wus
            ON wus.work_unit_id = w.id
    LEFT JOIN
        workers_skills      ws
            ON  ws.skill_id  = wus.skill_id
            AND ws.worker_id = 1
    GROUP BY
        w.id
    HAVING
        COUNT(wus.skill_id) = COUNT(ws.skill_id)
)
     applicable_work_units
         ON  applicable_work_units.work_unit_id = work_units.id

The excluded_work_units version of the code (the second example query above) should work without need for modification for this corner case (and is the one I'd initially trial for live performance metrics) . 代码的excluded_work_units版本(上面的第二个示例查询)应该可以在不需要修改此极端情况的情况下工作(并且是我最初试用的实时性能指标)

You can get the work units covered by a worker's skills in an aggregation, as has been shown already. 如已经显示的那样,您可以在聚合中获得工人技能所涵盖的工作单位。 You'd typically use IN on this set of work units then. 您通常会在这组工作单元上使用IN

SELECT wu.*
FROM work_units wu
-- some joins
WHERE wu.id IN
(
  SELECT wus.work_unit_id
  FROM work_units_skills wus
  LEFT JOIN workers_skills ws ON ws.skill_id = wus.skill_id AND ws.worker_id = 1
  GROUP BY wus.work_unit_id
  HAVING COUNT(*) = COUNT(ws.skill_id)
)
-- AND a bunch of other conditions
-- ORDER BY something complex
LIMIT 1
FOR UPDATE SKIP LOCKED;

When it comes to speeding up queries, the main part is often to provide the appropriate indexes, though. 在加速查询时,主要部分通常是提供适当的索引。 (With a perfect optimizer, re-writing a query to get the same result would have no effect at all, because the optimizer would get to the same execution plan.) (使用完美的优化器,重写一个查询以获得相同的结果将完全没有效果,因为优化器将达到相同的执行计划。)

You want the following indexes (order of the columns matters): 您需要以下索引(列的顺序很重要):

create index idx_ws on workers_skills (worker_id, skill_id);
create index idx_wus on work_units_skills (skill_id, work_unit_id);

(Read it like this: We come with a worker_id , get the skill_ids for the worker, join work units on these skill_ids and get thus the work_unit_ids .) (读它是这样的:我们带有一个worker_id ,获取worker的skill_ids ,加入这些skill_ids工作单元,从而获得work_unit_ids 。)

Might not apply to you, but I had a similar issue that I solved simply merging the main and sub into the same column using numbers for main and letters for sub. 可能不适用于你,但我有一个类似的问题,我解决了简单地将main和sub合并到同一列,使用主数字和sub的字母。

Btw, are all columns involved in the joins indexed? 顺便说一句,连接中涉及的所有列都被索引了吗? My server goes from 2-3 sec query on 500k+ tables to crash on 10k tables if I forget 如果我忘了,我的服务器从500k +表的2-3秒查询到10k表崩溃

With Postgres, relational division can often be expressed more efficiently using arrays. 使用Postgres,通常可以使用数组更有效地表达关系除法。

In your case I think the following will do what you want: 在你的情况下,我认为以下将做你想要的:

select *
from work_units
where id in (select work_unit_id
             from work_units_skills
             group by work_unit_id
             having array_agg(skill_id) <@ array(select skill_id 
                                                 from workers_skills 
                                                 where worker_id = 6))
and ... other conditions here ...
order by ...

array_agg(skill_id) collects all skill_ids for each work_unit and compares that with the skills of a specific worker using the <@ operator ("is contained by"). array_agg(skill_id)收集每个work_unit的所有skill_ids,并使用<@运算符(“包含”)将其与特定worker的技能进行比较。 That condition returns all work_unit_ids where the list of skill_ids is contained in the skills for a single worker. 该条件返回所有work_unit_ids,其中skill_id列表包含在单个worker的技能中。

In my experience this approach is usually faster then equivalent exists or intersect solutions. 根据我的经验,这种方法通常比同等存在或交叉解决方案更快。

Online example: http://rextester.com/WUPA82849 在线示例: http//rextester.com/WUPA82849

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM