类型不匹配 Spark Scala

Question

I am trying to create an empty dataframe an using it on a function but I am having the following error all time:我正在尝试创建一个空数据帧并在函数上使用它，但我一直遇到以下错误：

Required: DataFrame
Found: Dataset[DataFrame]

This is how I am doing it:这就是我的做法：

//Create empty DataFrame
val schema = StructType(
    StructField("g", StringType, true) ::
    StructField("tg", StringType, true) :: Nil)

var df1 = spark.createDataFrame(spark.sparkContext
      .emptyRDD[Row], schema)
//or
var df1 = spark.emptyDataFrame

Then I try to use it calling a functions as you can see here:然后我尝试使用它调用一个函数，如您所见：

  df1 = kvrdd1_toDF.map(x => function1(x, df1))

And this is the function:这是功能：

  def function1(input: org.apache.spark.sql.Row, df: DataFrame): DataFrame = {
    val v1 = spark.sparkContext.parallelize(Seq("g","tg"))
    var df3 = v1.toDF("g","tg")
    if (df.take(1).isEmpty){
      df3 = Seq((input.get(2), "nn")).toDF("g", "tg")
    } else {
      df3 = df3.union(df)
    }
    df3
  }

What am I doing wrong?我究竟做错了什么？

Answer 1

You have a DataFrame which is an alias for Dataset[Row] .您有一个DataFrame ，它是Dataset[Row]的别名。 You map that Row to a DataFrame so that's how you end up with a Dataset[DataFrame] .您将该Row映射到DataFrame以便最终得到Dataset[DataFrame] 。 I don't know what you are trying to do but it will never work.我不知道你想做什么，但它永远不会奏效。 The functions (and all its dependencies) you use to map the contents of a Dataset are serialized and distributed over your spark cluster.用于映射Dataset内容的函数（及其所有依赖项）已序列化并分布在 Spark 集群上。 You can't use another DataFrame or a SparkSession or SparkContext in such a function.您不能在这样的函数中使用另一个DataFrame或SparkSession或SparkContext 。

类型不匹配 Spark Scala

问题描述

1 个解决方案

解决方案1
0 2021-11-11 19:22:02

类型不匹配 Spark Scala

问题描述

1 个解决方案

解决方案1 0 2021-11-11 19:22:02

解决方案1
0 2021-11-11 19:22:02