简体   繁体   English

SQL自连接和聚合

[英]SQL self join and aggregation

I have a table with the following structure in postgres 我在postgres中有一个具有以下结构的表

Table path: passengers, origin, dest, date, month, year 表路径:乘客,出发地,目的地,日期,月份,年份

I want to find the top 3 routes based on the number of passengers travelled on a route in a year. 我想根据一年中某条路线上旅行的乘客数量找到前三条路线。 Total Number of passengers on a route (A <-> B) = Total Number of passengers (A -> B) + Total Number of passengers ( B->A ) 路线上的乘客总数(A <-> B)=乘客总数(A-> B)+乘客总数(B-> A)

What's the best / optimal way to aggregate the Number of passengers on a route, the table row count is approximately 150 million rows. 汇总路线上乘客人数的最佳/最佳方法是,表行数约为1.5亿行。

Thanks 谢谢

There are two approaches to this. 有两种方法。 One is aggregation and the other joins. 一种是聚合,另一种是联接。

select least(origin, dest) as od1, greatest(origin, dest) as od2, sum(passengers) as numpassengers
from path t
group by least(origin, dest), greatest(origin, dest)
order by numpassengers
limit 3;

The other is a self-join. 另一个是自连接。 If there is only one row in each direction, you can do this without aggregation: 如果每个方向上只有一行,则可以不进行汇总而这样做:

select p1.origin, p1.dest, p1.passengers + p2.passengers as numpassengers
from path p1 join
     path pt2
     on p1.origin = p2.dest and p1.dest = p2.origin
where p1.origin < p1.dest
order by numpassengers desc
limit 3;

Otherwise, you need a self join and aggregation, so the first method is probably faster: 否则,您需要自连接和聚合,因此第一种方法可能会更快:

select p1.origin, p1.dest, sum(p1.passengers + p2.passengers) as numpassengers
from path p1 join
     path pt2
     on p1.origin = p2.dest and p1.dest = p2.origin
where p1.origin < p1.dest
group by p1.origin, p1.dest
order by numpassengers desc
limit 3;

I do not know which would be more efficient. 我不知道哪个会更有效。 However, I suspect the top 3 routes by the sum would be in, say, the top 100 for each direction. 但是,我怀疑按总和计算的前3条路线将位于每个方向的前100条路线中。 If so, build an index on numpassengers, and try: 如果是这样,请在numpassengers上建立索引,然后尝试:

select least(origin, dest) as od1, greatest(origin, dest) as od2, sum(passengers) as numpassengers
from path t cross join
     (select min(passengers) as cutoff
      from (select distinct passengers
            from path
            order by passengers desc
            limit 100
           ) t
     ) minp
where numpassengers >= minp.cutoff
group by least(origin, dest), greatest(origin, dest)
order by numpassengers
limit 3;

The calculation of the cutoff should just use the index and greatly reduce the load of the rest of the query. 截止值的计算应仅使用索引,并大大减少其余查询的负担。

EDIT: 编辑:

If you don't have least() and greatest() , just use case statements: 如果您没有least()greatest() ,请使用case语句:

select (case when origin < dest then origin else dest end) as od1,
       (case when origin < dest then dest else origin end)  as od2,
       sum(passengers) as numpassengers
from path t
group by 1, 2
order by numpassengers
limit 3;

You can repeat the case statements in the group by . 您可以在group by重复case语句。 But Amazon Redshift lets you refer to column aliases or positions in the group by clause. 但是,Amazon Redshift允许您引用group by子句中的列别名或位置。

If every route is used in both directions that should give you an answer: 如果双向都使用了每条路线,那么您应该会得到一个答案:

SELECT (x.passengers + y.passengers) as passenders_sum, x.origin, y.dest
FROM yourTable x
JOIN yourTable y
ON x.origin = y.dest AND x.dest = y.origin
ORDER BY passenders_sum DESC;

With indexes on your origin and dest columns that self join should not make you worry. 在源和目标列上具有索引的情况下,自我联接不会使您担心。 I see no way to avoid an operation of that scale to get the requested result. 我看不出有什么方法可以避免进行这种规模的操作以获得所需的结果。 You will have to add some kind of LIMIT to that statement if you only want the top X rows. 如果只需要前X行,则必须在该语句中添加某种LIMIT I have no postgres experience on that. 我对此没有postgres经验。

I think SebastianH has it right. 我认为SebastianH是对的。 As a minor improvement you could try the following assuming postgressql supports the SELECT TOP clause: 作为一个较小的改进,您可以尝试以下操作,假设postgressql支持SELECT TOP子句:

SELECT TOP 3
    FROM (SELECT (SUM(A.PASSENGERS + B.PASSENGERS), A.ORIGIN, A.DEST)
          FROM YOURTABLE A JOIN YOURTABLE B
            ON (A.ORIGIN = B.DEST AND A.DEST = B.ORIGIN)
          GROUP BY A.ORIGIN, A.DEST
         )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM