[英]Spark Scala. Using an external variables “dataframe” in a map
I have two dataframes, 我有两个数据框
val df1 = sqlContext.csvFile("/data/testData.csv")
val df2 = sqlContext.csvFile("/data/someValues.csv")
df1=
startTime name cause1 cause2
15679 CCY 5 7
15683 2 5
15685 1 9
15690 9 6
df2=
cause description causeType
3 Xxxxx cause1
1 xxxxx cause1
3 xxxxx cause2
4 xxxxx
2 Xxxxx
and I want to apply a complex function getTimeCust
to both cause1 and cause2 to determine a final cause, then match the description of this final cause code in df2. 我想将复杂的函数getTimeCust
应用于cause1和cause2以确定最终原因,然后匹配df2中该最终原因代码的描述。 I must have a new df (or rdd) with the following columns: 我必须使用以下各列来创建新的df(或rdd):
startTime name cause descriptionCause
My solution was 我的解决方案是
val rdd2 = df1.map(row => {
val (cause, descriptionCause) = getTimeCust(row.getInt(2), row.getInt(3), df2)
Row (row(0),row(1),cause,descriptionCause)
})
If a run the code below I have a NullPointerException
because the df2 is not visible. 如果运行下面的代码,则将出现NullPointerException
因为df2不可见。
The function getTimeCust(Int, Int, DataFrame)
works well outside the map. 函数getTimeCust(Int, Int, DataFrame)
在地图外部运行良好。
Use df1.join(df2, <join condition>)
to join your dataframes together then select the fields you need from the joined dataframe. 使用df1.join(df2, <join condition>)
将数据框连接在一起,然后从连接的数据框中选择所需的字段。
You can't use spark's distributed structures (rdd, dataframe, etc) in code that runs on an executor (like inside a map). 您不能在执行程序(如地图内)上运行的代码中使用spark的分布式结构(rdd,数据框等)。
Try something like this: 尝试这样的事情:
def f1(cause1: Int, cause2: Int): Int = some logic to calculate cause
import org.apache.spark.sql.functions.udf
val dfCause = df1.withColumn("df1_cause", udf(f1)($"cause1", $"cause2"))
val dfJoined = dfCause.join(df2, on= df1Cause("df1_cause")===df2("cause"))
dfJoined.select("cause", "description").show()
Thank you @Assaf. 谢谢@Assaf。 Thanks to your answer and the spark udf with data frame . 感谢您的回答和带有数据框的火花udf 。 I have resolved the this problem. 我已经解决了这个问题。 The solution is: 解决方案是:
val getTimeCust= udf((cause1: Any, cause2: Any) => {
var lastCause = 0
var categoryCause=""
var descCause=""
lastCause= .............
categoryCause= ........
(lastCause, categoryCause)
})
and after call the udf as: 并将udf称为:
val dfWithCause = df1.withColumn("df1_cause", getTimeCust( $"cause1", $"cause2"))
ANd finally the join 终于加入了
val dfFinale=dfWithCause.join(df2, dfWithCause.col("df1_cause._1") === df2.col("cause") and dfWithCause.col("df1_cause._2") === df2.col("causeType"),'outer' )
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.