简体   繁体   中英

SQL self join and aggregation

I have a table with the following structure in postgres

Table path: passengers, origin, dest, date, month, year

I want to find the top 3 routes based on the number of passengers travelled on a route in a year. Total Number of passengers on a route (A <-> B) = Total Number of passengers (A -> B) + Total Number of passengers ( B->A )

What's the best / optimal way to aggregate the Number of passengers on a route, the table row count is approximately 150 million rows.

Thanks

There are two approaches to this. One is aggregation and the other joins.

select least(origin, dest) as od1, greatest(origin, dest) as od2, sum(passengers) as numpassengers
from path t
group by least(origin, dest), greatest(origin, dest)
order by numpassengers
limit 3;

The other is a self-join. If there is only one row in each direction, you can do this without aggregation:

select p1.origin, p1.dest, p1.passengers + p2.passengers as numpassengers
from path p1 join
     path pt2
     on p1.origin = p2.dest and p1.dest = p2.origin
where p1.origin < p1.dest
order by numpassengers desc
limit 3;

Otherwise, you need a self join and aggregation, so the first method is probably faster:

select p1.origin, p1.dest, sum(p1.passengers + p2.passengers) as numpassengers
from path p1 join
     path pt2
     on p1.origin = p2.dest and p1.dest = p2.origin
where p1.origin < p1.dest
group by p1.origin, p1.dest
order by numpassengers desc
limit 3;

I do not know which would be more efficient. However, I suspect the top 3 routes by the sum would be in, say, the top 100 for each direction. If so, build an index on numpassengers, and try:

select least(origin, dest) as od1, greatest(origin, dest) as od2, sum(passengers) as numpassengers
from path t cross join
     (select min(passengers) as cutoff
      from (select distinct passengers
            from path
            order by passengers desc
            limit 100
           ) t
     ) minp
where numpassengers >= minp.cutoff
group by least(origin, dest), greatest(origin, dest)
order by numpassengers
limit 3;

The calculation of the cutoff should just use the index and greatly reduce the load of the rest of the query.

EDIT:

If you don't have least() and greatest() , just use case statements:

select (case when origin < dest then origin else dest end) as od1,
       (case when origin < dest then dest else origin end)  as od2,
       sum(passengers) as numpassengers
from path t
group by 1, 2
order by numpassengers
limit 3;

You can repeat the case statements in the group by . But Amazon Redshift lets you refer to column aliases or positions in the group by clause.

If every route is used in both directions that should give you an answer:

SELECT (x.passengers + y.passengers) as passenders_sum, x.origin, y.dest
FROM yourTable x
JOIN yourTable y
ON x.origin = y.dest AND x.dest = y.origin
ORDER BY passenders_sum DESC;

With indexes on your origin and dest columns that self join should not make you worry. I see no way to avoid an operation of that scale to get the requested result. You will have to add some kind of LIMIT to that statement if you only want the top X rows. I have no postgres experience on that.

I think SebastianH has it right. As a minor improvement you could try the following assuming postgressql supports the SELECT TOP clause:

SELECT TOP 3
    FROM (SELECT (SUM(A.PASSENGERS + B.PASSENGERS), A.ORIGIN, A.DEST)
          FROM YOURTABLE A JOIN YOURTABLE B
            ON (A.ORIGIN = B.DEST AND A.DEST = B.ORIGIN)
          GROUP BY A.ORIGIN, A.DEST
         )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM