简体   繁体   English

Spark Scala。 在地图中使用外部变量“数据框”

[英]Spark Scala. Using an external variables “dataframe” in a map

I have two dataframes, 我有两个数据框

val df1 = sqlContext.csvFile("/data/testData.csv")
val df2 = sqlContext.csvFile("/data/someValues.csv")


 df1=
 startTime  name    cause1  cause2
 15679       CCY    5         7
 15683              2         5
 15685              1         9
 15690              9         6

df2=
cause   description causeType
3       Xxxxx       cause1
1       xxxxx       cause1
3       xxxxx       cause2
4       xxxxx
2       Xxxxx

and I want to apply a complex function getTimeCust to both cause1 and cause2 to determine a final cause, then match the description of this final cause code in df2. 我想将复杂的函数getTimeCust应用于cause1和cause2以确定最终原因,然后匹配df2中该最终原因代码的描述。 I must have a new df (or rdd) with the following columns: 我必须使用以下各列来创建新的df(或rdd):

startTime   name    cause   descriptionCause

My solution was 我的解决方案是

  val rdd2 = df1.map(row => {
  val (cause, descriptionCause) = getTimeCust(row.getInt(2), row.getInt(3), df2)
  Row (row(0),row(1),cause,descriptionCause)
  })

If a run the code below I have a NullPointerException because the df2 is not visible. 如果运行下面的代码,则将出现NullPointerException因为df2不可见。

The function getTimeCust(Int, Int, DataFrame) works well outside the map. 函数getTimeCust(Int, Int, DataFrame)在地图外部运行良好。

Use df1.join(df2, <join condition>) to join your dataframes together then select the fields you need from the joined dataframe. 使用df1.join(df2, <join condition>)将数据框连接在一起,然后从连接的数据框中选择所需的字段。

You can't use spark's distributed structures (rdd, dataframe, etc) in code that runs on an executor (like inside a map). 您不能在执行程序(如地图内)上运行的代码中使用spark的分布式结构(rdd,数据框等)。

Try something like this: 尝试这样的事情:

def f1(cause1: Int, cause2: Int): Int = some logic to calculate cause

import org.apache.spark.sql.functions.udf
val dfCause = df1.withColumn("df1_cause", udf(f1)($"cause1", $"cause2"))
val dfJoined = dfCause.join(df2, on= df1Cause("df1_cause")===df2("cause"))
dfJoined.select("cause", "description").show()

Thank you @Assaf. 谢谢@Assaf。 Thanks to your answer and the spark udf with data frame . 感谢您的回答和带有数据框火花udf I have resolved the this problem. 我已经解决了这个问题。 The solution is: 解决方案是:

   val getTimeCust= udf((cause1: Any, cause2: Any) => {
   var lastCause = 0
   var categoryCause=""
   var descCause=""
   lastCause= .............
   categoryCause= ........

    (lastCause, categoryCause)
  })

and after call the udf as: 并将udf称为:

  val dfWithCause = df1.withColumn("df1_cause", getTimeCust( $"cause1", $"cause2"))

ANd finally the join 终于加入了

 val dfFinale=dfWithCause.join(df2, dfWithCause.col("df1_cause._1") === df2.col("cause") and dfWithCause.col("df1_cause._2") === df2.col("causeType"),'outer' )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我想使用 scala 根据 spark 中的行数将一个大数据帧拆分为多个数据帧。 我想不通 - I want to split a big dataframe into multiple dataframes on basis of the rowcount in spark using scala. I am not able to figure it out 在 Scala./Spark 代码中获取 NullPointerException - Getting NullPointerException in scala./Spark code 使用 Spark Scala 将数据帧转换为哈希映射 - Convert dataframe to hash-map using Spark Scala 使用Scala将多列转换为Spark Dataframe上的一列地图 - Convert multiple columns into a column of map on Spark Dataframe using Scala Scala将DataFrame列作为地图并使用foldleft进行比较 - Scala spark DataFrame columns as map and compare them using foldleft 使用SCALA删除DataFrame中的空格。 (我已将CSV文件加载到RDD中,然后尝试从中删除空格 - removing spaces in DataFrame using SCALA. (I have loaded CSV file into RDD then trying to remove spaces from it 匹配向量Spark Scala中的Dataframe分类变量 - Match Dataframe Categorical Variables in vector Spark Scala 将地图的java列表转换为scala中的spark数据帧 - convert java list of map to spark dataframe in scala Spark Scala从数据框中的列表创建地图 - spark scala create map from list in dataframe 将 Spark Dataframe 转换为 Scala Map 集合列表 - Converting a Spark Dataframe to a Scala Map collection list
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM