[英]Order by and Join in SQL or spark or mapreduce
I have two tables whose content is as below. 我有两个表,其内容如下。
Table 1:
ID1 ID2 ID3 ID4 NAME DESCR STATUS date
1 -12134 17773 8001300701101 name1 descr1 INACTIVE 20121203
2 -12136 17773 8001300701101 name1 descr1 INACTIVE 20121202
3 -12138 17785 9100000161822 name3 descr3 INACTIVE 20121201
4 -12140 17785 9100000161822 name3 descr3 ACTIVE 20121130
5 -12142 17787 8000500039106 name4 descr4 ACTIVE 20121129
Table2:
ID1 ID2 ID3 ID4 NAME DESCR
0 17781 17773 8001300701101 name1 descr1
0 17783 17783 8001300060109 name2 descr2
0 17785 17785 9100000161822 name3 descr3
0 17787 17787 8000500039106 name4 descr4
0 17789 17789 0000080052364 name5 descr5
I am trying to get below result. 我试图得到低于结果。
ID3 ID4 NAME DESCR STATUS date
17773 8001300701101 name1 descr1 INACTIVE 20121202
17783 8001300060109 name2 descr2 NULL NULL
17785 9100000161822 name3 descr3 ACTIVE 20121201
17787 8000500039106 name4 descr4 ACTIVE 20121129
17789 0000080052364 name5 descr5 NULL NULL
As per the above i/p and o/p, the two tables should be joined based on columns id3, id4, name and desc. 根据上面的i / p和o / p,应基于id3,id4,name和desc列将两个表连接在一起。 if an active record exists, it should return the active record.
如果存在活动记录,则应返回该活动记录。 but if inactive record only exists, then oldest inactive record should be joined.
但如果仅存在非活动记录,则应合并最旧的非活动记录。
I tried different queries which are no longer near to the answer i wanted. 我尝试了不同的查询,这些查询不再接近我想要的答案。 The four columns joined are all non primary fields but not nulls.
连接的四列都是非主要字段,但不是null。 There can be one to many or many to many relationship between the two tables.
两个表之间可能存在一对多或多对多关系。
I am working on Apache phoenix and if the solution is in Hadoop Mapreduce or in Apache Spark also OK. 我正在使用Apache phoenix,并且如果解决方案是在Hadoop Mapreduce或Apache Spark中也可以。
A sample query i have written is as follows. 我编写的示例查询如下。
Select table2.*, table1.status, table1.date
From table1 Right outer join table2 on table1.id3 = table2.id3
and table1.id4 = table2.id4
and table1.name = table2.name
and table1.descr = table2.descr
Order by (status) and order by (date)
Can any one help me please? 有人可以帮我吗?
You cannot do a straight join against Table 1. Instead, you have to join against multiple queries of Table 1, which are themselves joined together. 您不能对表1进行直接联接。相反,必须对表1的多个查询进行联接,而表1本身又联接在一起。 By my count, you are going to have to do:
根据我的判断,您将必须做:
date
for ACTIVE
records in table 1 per ID3, ID4, etc. ACTIVE
记录的最小date
。 date
for INACTIVE
records in table 1 INACTIVE
记录的最短date
coalesce
to select ACTIVE
versus INACTIVE
fields. coalesce
选择ACTIVE
与INACTIVE
领域。 Something like this: 像这样:
val cookedTable1 = table1.filter(
$"STATUS" === "ACTIVE"
).groupBy(
$"ID3", $"ID4", $"NAME", $"DESCR", $"STATUS"
).agg(
$"ID3", $"ID4", $"NAME", $"DESCR", $"STATUS", min($"date") as "date"
).join(
table1.filter(
$"STATUS" === "INACTIVE"
).groupBy(
$"ID3", $"ID4", $"NAME", $"DESCR", $"STATUS"
).agg(
$"ID3", $"ID4", $"NAME", $"DESCR", $"STATUS", min($"date") as "date"
).select(
$"ID3" as "ID3r", $"ID4" as "ID4r", $"NAME" as "NAMEr", $"DESCR" as "DESCRr",
$"STATUS" as "STATUSr", $"date" as "dater"
),
$"ID3" === $"ID3r" and $"ID4" === $"ID4r" and $"NAME" === $"NAMEr" and $"DESCR" === $"DESCRr",
"full_outer"
)
.select(
coalesce($"ID3", $"ID3r") as "ID3",
coalesce($"ID4",$"ID4r") as "ID4",
coalesce($"NAME", $"NAMEr") as "NAME",
coalesce($"DESCR", $"DESCRr") as "DESCR",
coalesce($"STATUS", $"STATUSr") as "STATUS",
coalesce($"date", $"dater") as "date"
)
Given what you have in your above table 1, the result would look like: 给定您在上表1中的内容,结果将如下所示:
cookedTable1.show
ID3 ID4 NAME DESCR STATUS date
17785 9100000161822 name3 descr3 ACTIVE 20121130
17787 8000500039106 name4 descr4 ACTIVE 20121129
17773 8001300701101 name1 descr1 INACTIVE 20121202
Now, using cookedTable1
in place of table1
, do the same query you did before: 现在,使用
cookedTable1
代替table1
,执行与之前相同的查询:
cookedTable1.registerTempTable("cookedTable1")
val results = sqlContext("Select table2.*, cookedTable1.status, cookedTable1.date
From cookedTable1 Right outer join table2 on cookedTable1.id3 = table2.id3
and cookedTable1.id4 = table2.id4
and cookedTable1.name = table2.name
andcookedTable1.descr = table2.descr"
)
This should get you the results you were originally looking for. 这样可以为您提供原本要寻找的结果。
I can only speak for Spark. 我只能代表Spark。 The query appears correct in terms of the Right Outer Join and the four join columns.
在“右外部连接”和四个连接列方面,查询显示正确。
In Spark (and AFAIK in ANSI sql) the order by is not the way you show but instead: 在Spark(以及ANSI sql中的AFAIK)中,排序方式不是您显示的方式,而是:
order by status, date
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.