以多种方式连接多个表

Question

I have 5 tables like below: Table A我有 5 个如下表：表 A

rank.秩。	input.输入。
0. 0。	aa啊
1. 1.	bb bb
2 2	cc抄送
3 3	dd dd

Table B表 B

rank.秩。	input.输入。
0. 0。	aa啊
3 3	cc抄送
4 4	dd dd
5 5	ee ee

Table C表C

rank.秩。	input.输入。
0. 0。	aa啊
5 5	ee ee
6 6	ff ff
7 7	gg gg

Table D表 D

rank.秩。	input.输入。
0. 0。	aa啊
2 2	bb bb
6 6	ff ff
7 7	gg gg

I need the output to be like below:我需要 output 如下所示：

Final table决赛桌

rank.秩。	input.输入。
0. 0。	aa啊
2 2	bb bb
3 3	cc抄送
5 5	ee ee
6 6	ff ff
7 7	gg gg

If i just cross join all the tables depending on the biggest table, i get the below output:如果我只是根据最大的表交叉连接所有表，我会得到以下 output：

rank.秩。	input.输入。
0. 0。	aa啊

Is there a way to get the output i want without having to do multiple joins across AB,BC,CD,BD etc..有没有办法获得我想要的 output 而无需跨 AB、BC、CD、BD 等进行多次连接。

Please let me know.请告诉我。 I can either use SQL or Pyspark to do this.我可以使用 SQL 或 Pyspark 来执行此操作。 Any suggestions would be appreciated.任何建议，将不胜感激。

Answer 1

You can union all the tables, group by input and get the maximum of the rank:您可以合并所有表，按输入分组并获得排名的最大值：

select max(`rank`) as `rank`, input
from (
    select * from tableA
    union all
    select * from tableB
    union all
    select * from tableC
    union all
    select * from tableD
) t
group by input

In Pyspark it would be在 Pyspark 它将是

from functools import reduce

df = reduce(lambda a, b: a.unionAll(b), [tableA,tableB,tableC,tableD])
result = df.groupBy('input').agg(F.max('rank').alias('rank'))

以多种方式连接多个表

问题描述

1 个解决方案

解决方案1
1 2021-04-14 12:09:49

以多种方式连接多个表

问题描述

1 个解决方案

解决方案1 1 2021-04-14 12:09:49

解决方案1
1 2021-04-14 12:09:49