简体   繁体   English

条件连接,触发 scala(在限制之间)

[英]Conditional join, spark scala (between limits)

I have two dataframes.我有两个数据框。

df1: df1:

Team, Sport, CostTicket
Stars, Fotball, 10
Circles, Fotball, 20
Stars, Basket, 12
Stars, Baseball, 14
Circles, Baseball, 25

and

df2: df2:

Team, Sport, CostRange, LowerLimit, UpperLimit
Stars, Football, 0<3, 0, 3
Stars, Football, 4<10, 4, 10
Stars, Football, 11<22, 11, 22
Stars, Football, 24<25, 24, 25
Circles, Football, 0<4, 0, 4
Circles, Football, 5<10, 5, 10
Circles, Football, 11<20, 11, 20
Circles, Football, 21<30, 21, 30
Stars, Basket, 0<2, 0, 2
Stars, Basket, 3<7, 3, 7
Stars, Basket, 8<19, 8, 19
Stars, Basket, 20<30, 20, 30
Circles, Basket, 0<1, 0, 1
Circles, Basket, 2<4, 2, 4
Circles, Basket, 5<15, 5, 15
Circles, Basket, 16<30, 16, 30
Stars, Baseball, 0<10, 0, 10
Stars, Baseball, 11<20, 11, 20
Stars, Baseball, 21<30, 21, 30
Circles, Baseball, 0<4, 0, 4
Circles, Baseball, 5<10, 5, 10
Circles, Baseball, 11<20, 11, 20
Circles, Baseball, 21<30, 21, 30

I want to add a fourth column to df1 with the CostRange from df2.我想使用来自 df2 的 CostRange 向 df1 添加第四列。

The final result should be:最终结果应该是:

Team, Sport, CostTicket, Range
Stars, Fotball, 10, 5<10
Circles, Fotball, 20, 11<22
Stars, Basket, 12, 8<19
Stars, Baseball, 14, 11<20
Circles, Baseball, 25, 21<30

I have come this far, but it does not work.我已经走到了这一步,但它不起作用。 Is there someone that can help me with this?有没有人可以帮我解决这个问题?

val df1 = df2.withColumn("Range", df2("CostRange"))
  .where(df1("CostTicket") > df2("LowerLimit"))
  .where(df1("CostTicket") < df2("UpperLimit"))
  .where(df1("Team") === df2("Team"))
  .where(df1("Sport") === df2("Sport"))

You can't select two columns from two different dataframes, You need to join two dataframes first您不能从两个不同的数据帧中选择两列,您需要先加入两个数据帧

You can join with two columns first and use where as below您可以先加入两列,然后使用where如下

df1.join(df2, Seq("Team", "Sport"))
    .where($"CostTicket"  >= $"LowerLimit" && $"CostTicket" <= $"UpperLimit")

Or you could specify in join condition itself as below或者您可以在连接条件本身中指定如下

df1.join(df2,
    df1("Team") === df2("Team") &&
    df1("Sport") === df2("Sport") &&
    df1("CostTicket") >= df2("LowerLimit") &&
    df1("CostTicket") <= df2("UpperLimit")
).drop(df2("Team"))
 .drop(df2("Sport"))

Output:输出:

+-------+--------+----------+---------+----------+----------+
|Team   |Sport   |CostTicket|CostRange|LowerLimit|UpperLimit|
+-------+--------+----------+---------+----------+----------+
|Stars  |Football|10        |4<10     |4         |10        |
|Circles|Football|20        |11<20    |11        |20        |
|Stars  |Basket  |12        |8<19     |8         |19        |
|Stars  |Baseball|14        |11<20    |11        |20        |
|Circles|Baseball|25        |21<30    |21        |30        |
+-------+--------+----------+---------+----------+----------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM