简体   繁体   English

如何使用 Java 在 Spark SQL 中加入多列以在 DataFrame 中进行过滤

[英]How to Join Multiple Columns in Spark SQL using Java for filtering in DataFrame

  • DataFrame a = contains column x,y,z,k DataFrame a = 包含列 x,y,z,k
  • DataFrame b = contains column x,y,a DataFrame b = 包含列 x,y,a

     a.join(b,<condition to use in java to use x,y >) ???

I tried using我尝试使用

a.join(b,a.col("x").equalTo(b.col("x")) && a.col("y").equalTo(b.col("y"),"inner")

But Java is throwing error saying && is not allowed.但是 Java 抛出错误,说&&是不允许的。

Spark SQL provides a group of methods on Column marked as java_expr_ops which are designed for Java interoperability. Spark SQL 在Column上提供了一组标记为java_expr_ops的方法,这些方法是为 Java 互操作性而设计的。 It includes and (see also or ) method which can be used here:它包括and (另见or )方法,可在此处使用:

a.col("x").equalTo(b.col("x")).and(a.col("y").equalTo(b.col("y"))

If you want to use Multiple columns for join, you can do something like this:如果要使用多列进行连接,可以执行以下操作:

a.join(b,scalaSeq, joinType)

You can store your columns in Java-List and convert List to Scala seq.您可以将列存储在 Java-List 中并将 List 转换为 Scala seq。 Conversion of Java-List to Scala-Seq: Java-List 到 Scala-Seq 的转换:

scalaSeq = JavaConverters.asScalaIteratorConverter(list.iterator()).asScala().toSeq();

Example: a = a.join(b, scalaSeq, "inner");示例: a = a.join(b, scalaSeq, "inner");

Note: Dynamic columns will be easily supported in this way.注意:通过这种方式可以轻松支持动态列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM