[英]Spark how to use a UDF with a Join
I'd like to use a specific UDF
with using Spark
我想通过
Spark
使用特定的UDF
Here's the plan: 这是计划:
I have a table A
(10 million rows) and a table B
(15 millions rows) 我有一个
table A
(一千万行)和一个table B
(一千五百万行)
I'd like to use an UDF
comparing one element of the table A
and one of the table B
Is it possible 我想使用
UDF
比较table A
一个元素和table B
一个元素
Here's aa sample of my code. 这是我的代码的示例。 At some point i also need to say that my
UDF
compare must be greater than 0,9
: 在某些时候,我还需要说我的
UDF
比较必须大于0,9
:
DataFrame dfr = df
.select("name", "firstname", "adress1", "city1","compare(adress1,adress2)")
.join(dfa,df.col("adress1").equalTo(dfa.col("adress2"))
.and((df.col("city1").equalTo(dfa.col("city2"))
...;
Is it possible ? 可能吗 ?
Yes, you can. 是的你可以。 However it will be slower than normal operators, as Spark will be not able to do predicate pushdown
但是,由于Spark无法进行谓词下推,因此它将比普通运算符要慢
Example: 例:
val udf = udf((x : String, y : String) => { here compute similarity; });
val df3 = df1.join(df2, udf(df1.field1, df2.field1) > 0.9)
For example: 例如:
val df1 = Seq (1, 2, 3, 4).toDF("x")
val df2 = Seq(1, 3, 7, 11).toDF("q")
val udf = org.apache.spark.sql.functions.udf((x : Int, q : Int) => { Math.abs(x - q); });
val df3 = df1.join(df2, udf(df1("x"), df2("q")) > 1)
You can also directly return boolean from User Defined Function 您也可以直接从用户定义函数返回布尔值
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.