Spark Scala。在地图中使用外部变量“数据框”

Question

I have two dataframes, 我有两个数据框

val df1 = sqlContext.csvFile("/data/testData.csv")
val df2 = sqlContext.csvFile("/data/someValues.csv")


 df1=
 startTime  name    cause1  cause2
 15679       CCY    5         7
 15683              2         5
 15685              1         9
 15690              9         6

df2=
cause   description causeType
3       Xxxxx       cause1
1       xxxxx       cause1
3       xxxxx       cause2
4       xxxxx
2       Xxxxx

and I want to apply a complex function getTimeCust to both cause1 and cause2 to determine a final cause, then match the description of this final cause code in df2. 我想将复杂的函数getTimeCust应用于cause1和cause2以确定最终原因，然后匹配df2中该最终原因代码的描述。 I must have a new df (or rdd) with the following columns: 我必须使用以下各列来创建新的df（或rdd）：

startTime   name    cause   descriptionCause

My solution was 我的解决方案是

  val rdd2 = df1.map(row => {
  val (cause, descriptionCause) = getTimeCust(row.getInt(2), row.getInt(3), df2)
  Row (row(0),row(1),cause,descriptionCause)
  })

If a run the code below I have a NullPointerException because the df2 is not visible. 如果运行下面的代码，则将出现NullPointerException因为df2不可见。

The function getTimeCust(Int, Int, DataFrame) works well outside the map. 函数getTimeCust(Int, Int, DataFrame)在地图外部运行良好。

Answer 1

Use df1.join(df2, <join condition>) to join your dataframes together then select the fields you need from the joined dataframe. 使用df1.join(df2, <join condition>)将数据框连接在一起，然后从连接的数据框中选择所需的字段。

You can't use spark's distributed structures (rdd, dataframe, etc) in code that runs on an executor (like inside a map). 您不能在执行程序（如地图内）上运行的代码中使用spark的分布式结构（rdd，数据框等）。

Answer 2

Try something like this: 尝试这样的事情：

def f1(cause1: Int, cause2: Int): Int = some logic to calculate cause

import org.apache.spark.sql.functions.udf
val dfCause = df1.withColumn("df1_cause", udf(f1)($"cause1", $"cause2"))
val dfJoined = dfCause.join(df2, on= df1Cause("df1_cause")===df2("cause"))
dfJoined.select("cause", "description").show()

Answer 3

Thank you @Assaf. 谢谢@Assaf。 Thanks to your answer and the spark udf with data frame . 感谢您的回答和带有数据框的火花udf 。 I have resolved the this problem. 我已经解决了这个问题。 The solution is: 解决方案是：

   val getTimeCust= udf((cause1: Any, cause2: Any) => {
   var lastCause = 0
   var categoryCause=""
   var descCause=""
   lastCause= .............
   categoryCause= ........

    (lastCause, categoryCause)
  })

and after call the udf as: 并将udf称为：

  val dfWithCause = df1.withColumn("df1_cause", getTimeCust( $"cause1", $"cause2"))

ANd finally the join 终于加入了

 val dfFinale=dfWithCause.join(df2, dfWithCause.col("df1_cause._1") === df2.col("cause") and dfWithCause.col("df1_cause._2") === df2.col("causeType"),'outer' )

Spark Scala。在地图中使用外部变量“数据框”

问题描述

3 个解决方案

解决方案1
2 2017-01-05 15:04:25

解决方案2
0 2017-01-05 17:28:31

解决方案3
0 已采纳 2017-01-10 16:58:03

Spark Scala。 在地图中使用外部变量“数据框”

问题描述

3 个解决方案

解决方案1 2 2017-01-05 15:04:25

解决方案2 0 2017-01-05 17:28:31

解决方案3 0 已采纳 2017-01-10 16:58:03

Spark Scala。在地图中使用外部变量“数据框”

解决方案1
2 2017-01-05 15:04:25

解决方案2
0 2017-01-05 17:28:31

解决方案3
0 已采纳 2017-01-10 16:58:03