There is a set of users. A person can have multiple users, but ref1
and ref2
might be alike and can therefore link users together. ref1
and ref2
does not overlap, one value in ref1
does not exist in ref2
.
A user can own multiple assets. I want to "merge" users that has one or more refs alike and then count how many assets they own together. There could be missing entries in the user table, in that case I just want to propagate the owner into ref2 and set the asset_count and asset_ids.
Here is an example schema to illustrate:
Example assets
SELECT * FROM assets;
id | name | owner |
---|---|---|
1 | #1 | a |
2 | #2 | b |
3 | #3 | c |
4 | #4 | a |
5 | #5 | c |
6 | #6 | d |
7 | #7 | e |
8 | #8 | d |
9 | #9 | a |
10 | #10 | a |
11 | #11 | z |
Example users
SELECT * FROM users;
id | username | ref1 | ref2 |
---|---|---|---|
1 | bobo | a | d |
2 | toto | b | e |
3 | momo | c | d |
4 | lolo | a | f |
5 | popo | c | f |
What I want to get in the end
SELECT * FROM results;
ids | usernames | refs1 | refs2 | asset_ids | asset_count |
---|---|---|---|---|---|
1,3,4,5 | bobo,momo,lolo,popo | a,c | d,f | 1,3,4,5,6,8,9,10 | 8 |
2 | toto | b | e | 2,7 | 2 |
z | 11 | 1 |
I've tried different approaches, but this is what I currently have:
Closest I have got
SELECT
ARRAY_AGG(DISTINCT u.id) AS ids,
ARRAY_AGG(DISTINCT u.username) AS usernames,
ARRAY_AGG(DISTINCT u.ref1) AS refs1,
ARRAY_AGG(DISTINCT u.ref2) AS refs2,
COUNT(DISTINCT a.id) AS asset_count
FROM assets a
JOIN users u ON a.owner = u.ref1 OR a.owner = u.ref2
GROUP BY a.owner
ORDER BY MIN(a.id);
ids | usernames | refs1 | refs2 | asset_count |
---|---|---|---|---|
1,4 | bobo,lolo | a | d,f | 4 |
2 | toto | b | e | 1 |
3,5 | momo,popo | c | d,f | 2 |
1,3 | bobo,momo | a,c | d | 2 |
2 | toto | b | e | 1 |
If I merge the above table on ids, I almost get the result I want (without the missing entries in the user table). The merging can easily be done in code, but then I cannot paginate etc. I want to to this in DB layer if possible.
I want either a solution to the problem or a good explanation of why it is not possible to do (with examples).
Please check out my DB Fiddle .
There are two distinct parts to the question:
Identifying clusters of users based on common references reads like a graph-walking problem. That's a complex task in SQL, and requires a recursive query. The pattern is to unpivot users' references to generate nodes, then identify edges (nodes that have a ref in common), and finally walk through the graph (whitout looping) to generate groups.
In Postgres, arrays come handy to aggregate nodes:
with recursive
nodes as (
select u.id, r.ref
from users u
cross join lateral ( values (u.ref1), (u.ref2) ) r(ref)
),
edges as (
select distinct n1.id as id1, n2.id as id2
from nodes n1
inner join nodes n2 on n1.ref = n2.ref
),
rcte as (
select id1, id2, array[id1] as visited from edges where id1 = id2
union all
select r.id1, e.id2, r.visited || e.id2
from rcte r
inner join edges e on e.id1 = r.id2
where e.id2 <> all(r.visited)
),
groups as (
select id1 as id, array_agg(distinct id2 order by id2) as ids
from rcte
group by id1
)
select * from groups order by id
id | ids |
---|---|
1 | {1,3,4,5} |
2 | {2} |
3 | {1,3,4,5} |
4 | {1,3,4,5} |
5 | {1,3,4,5} |
left join
and aggregationNow that we identified the groups, we can check for assets. Since you want all assets in the result, we start from the assets
table, then bring the users and the groups with left join
s. We can still group by
the user groups, which puts all orphan assets in the same group (where the group is null
) - that's exactly what we want.
The last step is array aggregation; the "propagation" of the owners of orphan assets to ref2
can be handled with a case
expression.
with recursive
nodes as (...),
edges as (...),
rcte as (...),
groups as (...)
select g.ids,
array_agg(distinct u.username) as usernames,
array_agg(distinct u.ref1) as refs1,
case when g.ids is null then array_agg(distinct a.owner) else array_agg(distinct u.ref2) end as refs2,
array_agg(distinct a.id) as asset_ids,
count(distinct a.id) as asset_count
from assets a
left join users u on a.owner in (u.ref1, u.ref2)
left join groups g on g.id = u.id
group by g.ids
ids | usernames | refs1 | refs2 | asset_ids | asset_count |
---|---|---|---|---|---|
{1,3,4,5} | {bobo,lolo,momo,popo} | {a,c} | {d,f} | {1,3,4,5,6,8,9,10} | 8 |
{2} | {toto} | {b} | {e} | {2,7} | 2 |
null | {NULL} | {NULL} | {z} | {11} | 1 |
Performance will suffer on.networks that have a lot of edges, especially if there are just few clusters in the graph, each containing with many users. We can try and optimize the query for such situation; the idea is to try and limit the number of paths that need to be walked, by aggregating all paths of each user at each iteration .
This query passes your test case with 200 users that all belong to the same cluster (whereas the first query exhausts the DB Fiddle resources):
with recursive
nodes as (
select u.id, r.ref
from users u
cross join lateral ( values (u.ref1), (u.ref2) ) r(ref)
),
edges as (
select distinct n1.id as id1, n2.id as id2
from nodes n1
inner join nodes n2 on n1.ref = n2.ref
),
rcte as (
select id1 as id, array[id1] as visited from edges where id1 = id2
union all
select id, array_agg(distinct visited) as visited
from (
select r.id, v.visited
from rcte r
inner join edges e on e.id1 = any(r.visited)
cross join lateral unnest(r.visited || e.id2) v(visited)
where e.id2 <> all(r.visited)
) as x
group by id
),
groups as (
select distinct on (id) id, visited as ids
from rcte
order by id, array_length(visited, 1) desc
)
select * from groups g order by id
This is hard (and typically very slow) to solve with only set-based operations in SQL. So I loop in a PL/pgSQL function:
CREATE OR REPLACE FUNCTION f_merged_users()
RETURNS TABLE (
ids int[] -- adapt to your actual data types !!!
, usernames text[]
, refs1 text[]
, refs2 text[]
, asset_ids int[]
, asset_count int
)
LANGUAGE plpgsql AS
$func$
BEGIN
/*
-- set enough temp buffers, so temp tables don't spill to disk (optional)
-- NOTE: this can't be used after temporary tables have been accessed in the session
-- using 2x size of users table. adapt to your needs!
-- only fires when the new size is bigger, and it has not been set manually, yet
PERFORM set_config('temp_buffers', (pg_table_size('users') * 2 / 8096)::text, true)
FROM pg_settings s
WHERE s.name = 'temp_buffers'
AND pg_table_size('users') * 2 / 8096 > s.setting::bigint
AND s.source = 'default';
*/
-- create two temp tables: one for ref1, one for ref2
-- based on your information that ref1 & ref2 don't overlap
CREATE TEMP TABLE r1 ON COMMIT DROP AS
SELECT ARRAY[ref1] AS r, array_agg(id) AS i
FROM (TABLE users ORDER BY id) t
GROUP BY ref1
ORDER BY ref1;
CREATE TEMP TABLE r2 ON COMMIT DROP AS
SELECT ARRAY[ref2] AS r, array_agg(id) AS i
FROM (TABLE users ORDER BY id) t
GROUP BY ref2
ORDER BY ref2;
-- fold rows where a common link in the other table exists
-- achieve that by deleting rows and inserting the merged results
-- loop back and forth until no more rows can be folded
LOOP
WITH d AS (
DELETE FROM r2
USING r1
WHERE EXISTS (
SELECT FROM r2 r0
WHERE r1.i && r2.i
AND r1.i && r0.i
AND r2.ctid <> r0.ctid
)
RETURNING r1.ctid, r2.*
)
INSERT INTO r2
SELECT r.r, i.i
FROM (
SELECT ctid, array_agg(ref ORDER BY ref) AS r
FROM d, unnest(d.r) ref
GROUP BY 1
) r
JOIN (
SELECT ctid, array_agg(id ORDER BY id) AS i
FROM d, unnest(d.i) id
GROUP BY 1
) i USING (ctid);
EXIT WHEN NOT FOUND; -- no folding happened, stop loop
WITH d AS (
DELETE FROM r1
USING r2
WHERE EXISTS (
SELECT FROM r1 r0
WHERE r2.i && r1.i
AND r2.i && r0.i
AND r1.ctid <> r0.ctid
)
RETURNING r2.ctid, r1.*
)
INSERT INTO r1
SELECT r.r, i.i
FROM (
SELECT ctid, array_agg(ref ORDER BY ref) AS r
FROM d, unnest(d.r) ref
GROUP BY 1
) r
JOIN (
SELECT ctid, array_agg(id ORDER BY id) AS i
FROM d, unnest(d.i) id
GROUP BY 1
) i USING (ctid);
EXIT WHEN NOT FOUND; -- no folding happened, stop loop
END LOOP;
-- output result
RETURN QUERY
(
SELECT i -- AS ids
, u.usernames
, r1.r -- AS refs1
, r2.r -- AS refs2
, a.asset_ids
, cardinality(a.asset_ids) -- AS asset_count
FROM r1
JOIN r2 USING (i)
CROSS JOIN LATERAL (
SELECT ARRAY (
SELECT username
FROM users u
WHERE u.id = ANY (i)
ORDER BY u.id
) AS usernames
) u
CROSS JOIN LATERAL (
SELECT ARRAY (
SELECT a.id
FROM assets a
WHERE a.owner = ANY (r1.r || r2.r)
ORDER BY a.id
) AS asset_ids
) a
ORDER BY i
)
UNION ALL -- add "missing entries in the user table"
(
SELECT null -- AS ids
, null -- AS usernames
, null -- AS refs1
, ARRAY[a.owner] -- AS refs2
, ARRAY[a.id] -- AS ids
, 1 -- AS asset_count
FROM assets a
WHERE NOT EXISTS (
SELECT FROM users u
WHERE a.owner IN (u.ref1, u.ref2)
)
ORDER BY a.id
);
END
$func$;
Call:
SELECT * FROM f_merged_users();
See inline comments with explanation.
The general idea is to avoid walking potentially lengthy graphs one step at time, and do as much set-based work as possible. The best solution heavily depends on actual data distribution.
Based on your information that ref1
an ref2
don't overlap, I create two temp tables: one for ref1
, one for ref2
. The initial queries aggregate all users on the same resource. That takes care of many users on the same resource (same person) right away. You commented that you have a lot of those.
In every iteration of the loop, I merge all rows in the other temp table that are linked together by the same row in this temp table, and vice versa. The two queries are completely symmetrical. (It may be more efficient to start with one or the other.) We are done as soon as one query does not find any more rows to merge.
Then return the final set with all details, and append "missing entries in the user table" like you added in the question.
I have streamlined the workflow, but DELETE
and INSERT
are comparatively expensive operations. Make sure you allow enough temporary buffers so that the two temp tables can operate in RAM exclusively. I added a bit of clever automatic setting to the top. But you may not need that if your temp_buffer
setting is high enough anyway. And the automatic setting can't be used after you have accessed any temp tables in the same session. See:
The two temp tables are dropped at the end of the transaction automatically since created with ON COMMIT DROP
.
Performance might be optimized some more with indexes on the temp tables and some more fine-tuning. In this case, it may also help to run ANALYZE
(one or more times) on the temp tables. See:
About the ctid
I used:
What you're showing is already a valid graph, so you can query it as such using pgRouting
. Here's how you present this as a graph to the extension:
create table users1 as
select id,
ref1::int as source,
ref2::int as target,
1::numeric as cost,
1::numeric as reverse_cost
from users
union all
select id, --the same edges, pointed the other way
source,
target,
1::numeric as cost,
1::numeric as reverse_cost
from users;
If ref1
and ref2
aren't int
, you might neet to map them. Then, here's how you ask it about the groups:
select component as "the_group",
array_agg(node) as "ids_in_this_group"
from pgr_connectedComponents(
'SELECT id,
ref1 as source,
ref2 as target,
cost,
reverse_cost
FROM users1')
group by component;
That is, until SQL:2023 's SQL/PGQ makes its way into a PostgreSQL release.
It takes seconds on a sample of 1 million users, each one connected to the next. I've also tested this on a few mixes of large, disconnected groups, volumes of 1,2,3..N-element groups, large random configurations, but one large chain seems to be the worst case scenario for all current answers here, while everything else only seems to speed things up.
db<>fiddle , db-fiddle and sqlfiddle don't offer pgrouting
, which is why I couldn't attach an online demo.
Please look next solution:
WITH agg1 as (
SELECT
string_agg(distinct users.id::text, ',') id,
string_agg(distinct username, ',') username,
string_agg(distinct ref2, ',') ref2,
string_agg(distinct assets.id::text, ',') assets_id,
sum(assets_count) assets_count,
ref1
FROM users
JOIN (
SELECT string_agg(id::text, ',') id, count(*) assets_count, owner
FROM assets GROUP BY owner
) assets ON assets.owner in (ref1, ref2)
GROUP BY ref1
) SELECT
string_to_array(string_agg(id::text, ','),',') ids,
string_to_array(string_agg(username, ','),',') usernames,
string_to_array(string_agg(ref1, ','),',') refs1,
string_to_array(ref2,',') refs2,
string_to_array(string_agg(assets_id, ','),',') assets_id,
min(assets_count) assets_count
FROM agg1
GROUP BY agg1.ref2
;
https://sqlize.online/sql/psql11/88ab227ab4a34c532fc711cff533f272/
It returns almost desired result. You only need to unique assets_id array
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.