激发如何在联接中使用UDF

Question

I'd like to use a specific UDF with using Spark 我想通过Spark使用特定的UDF

Here's the plan: 这是计划：

I have a table A (10 million rows) and a table B (15 millions rows) 我有一个table A （一千万行）和一个table B （一千五百万行）

I'd like to use an UDF comparing one element of the table A and one of the table B Is it possible 我想使用UDF比较table A一个元素和table B一个元素

Here's aa sample of my code. 这是我的代码的示例。 At some point i also need to say that my UDF compare must be greater than 0,9 : 在某些时候，我还需要说我的UDF比较必须大于0,9 ：

DataFrame dfr = df
                .select("name", "firstname", "adress1", "city1","compare(adress1,adress2)")
                .join(dfa,df.col("adress1").equalTo(dfa.col("adress2"))
                        .and((df.col("city1").equalTo(dfa.col("city2"))
                                ...;

Is it possible ? 可能吗？

Answer 1

Yes, you can. 是的你可以。 However it will be slower than normal operators, as Spark will be not able to do predicate pushdown 但是，由于Spark无法进行谓词下推，因此它将比普通运算符要慢

Example: 例：

val udf = udf((x : String, y : String) => { here compute similarity; });
val df3 = df1.join(df2, udf(df1.field1, df2.field1) > 0.9)

For example: 例如：

val df1 = Seq (1, 2, 3, 4).toDF("x")
val df2 = Seq(1, 3, 7, 11).toDF("q")
val udf = org.apache.spark.sql.functions.udf((x : Int, q : Int) => { Math.abs(x - q); });
val df3 = df1.join(df2, udf(df1("x"), df2("q")) > 1)

You can also directly return boolean from User Defined Function 您也可以直接从用户定义函数返回布尔值

激发如何在联接中使用UDF

问题描述

1 个解决方案

解决方案1
5 2017-08-16 16:44:34

激发如何在联接中使用UDF

问题描述

1 个解决方案

解决方案1 5 2017-08-16 16:44:34

解决方案1
5 2017-08-16 16:44:34