简体   繁体   English

错误:无法找到 org.apache.spark.sql.Dataset [(String, Long)] 类型的编码器

[英]Error: Unable to find encoder for type org.apache.spark.sql.Dataset[(String, Long)]

Following test for Dataset comparison is failing with the error:以下数据集比较测试失败并出现错误:

Error:(55, 38) Unable to find encoder for type org.apache.spark.sql.Dataset[(String, Long)]. An implicit Encoder[org.apache.spark.sql.Dataset[(String, Long)]] is needed to store org.apache.spark.sql.Dataset[(String, Long)] instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.
    ).toDF("lower(word)", "count").as[Dataset[(String, Long)]]
Error:(55, 38) not enough arguments for method as: (implicit evidence$2: org.apache.spark.sql.Encoder[org.apache.spark.sql.Dataset[(String, Long)]])org.apache.spark.sql.Dataset[org.apache.spark.sql.Dataset[(String, Long)]].
Unspecified value parameter evidence$2.
    ).toDF("lower(word)", "count").as[Dataset[(String, Long)]]

Test测试

As you can see, I tried creating the Kryo Encoder for (String, Long)如您所见,我尝试为 (String, Long) 创建 Kryo 编码器

class WordCountDSAppTestSpec extends FlatSpec with SparkSessionTestWrapper with DatasetComparer {

  import spark.implicits._

  "countWords" should "return count of each word" in {

    val wordsDF = Seq(
      ("one", "one"),
      ("two", "two"),
      ("three Three", "three"),
      ("three Three", "Three"),
      ("", "")
    ).toDF("line", "word").as[LineAndWord]

    implicit val tupleEncoder = org.apache.spark.sql.Encoders.kryo[(String, Long)]
    val expectedDF = Seq(
      ("one", 1L),
      ("two", 1L),
      ("three", 2L)
    ).toDF("lower(word)", "count").as[Dataset[(String, Long)]]

    val actualDF = WordCountDSApp.countWords(wordsDF)

    assertSmallDatasetEquality(actualDF, expectedDF, orderedComparison = false)
  }
}

Spark App under test正在测试的 Spark 应用程序

import com.aravind.oss.Logging
import com.aravind.oss.eg.wordcount.spark.WordCountUtil.{WhitespaceRegex, getClusterCfg, getPaths, getSparkSession}
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.functions.{explode, split}

object WordCountDSApp extends App with Logging {
  logInfo("WordCount with Dataset API and multiple Case classes")
  val paths = getPaths(args)
  val cluster = getClusterCfg(args)

  if (paths.size > 1) {
    logInfo("More than one file to process")
  }
  logInfo("Path(s): " + paths)
  logInfo("Cluster: " + cluster)

  val spark = getSparkSession("WordCountDSApp", cluster)

  import spark.implicits._

  /*
  case class <code>Line<code> SHOULD match the number of columns in the input file
   */
  val linesDs: Dataset[Line] = spark.read
    .textFile(paths: _*)
    .toDF("line")
    .as[Line]
  logInfo("Dataset before splitting line")
  linesDs.show(false)

  /*
  <code>toWords<code> adds additional column (word) to the output so we need a
  new case class <code>LineAndWord</code> that contains two properties to represent two columns.
  The names of the properties should match the name of the columns as well.
   */
  val wordDs: Dataset[LineAndWord] = toWords(linesDs)
  logInfo("Dataset after splitting the line into words")
  wordDs.show(false)

  val wordCount = countWords(wordDs)
  wordCount
    .orderBy($"count(1)".desc)
    .show(false)

  def toWords(linesDs: Dataset[Line]): Dataset[LineAndWord] = {
    import linesDs.sparkSession.implicits._
    linesDs
      .select($"line",
        explode(split($"line", WhitespaceRegex)).as("word"))
      .as[LineAndWord]
  }

  def countWords(wordsDs: Dataset[LineAndWord]): Dataset[(String, Long)] = {
    import wordsDs.sparkSession.implicits._
    val result = wordsDs
      .filter(_.word != null)
      .filter(!_.word.isEmpty)
      .groupByKey(_.word.toLowerCase)
      .count()

    result
  }

  case class Line(line: String)

  case class LineAndWord(line: String, word: String)

}

You should call as[Something] , not .as[Dataset[Something]] .您应该调用as[Something] ,而不是.as[Dataset[Something]] Here is working version:这是工作版本:


"countWords" should "return count of each word" in {
  import org.apache.spark.sql.{Encoder, Encoders}
  import spark.implicits._
  implicit def tuple2[A1, A2](implicit e1: Encoder[A1],
                              e2: Encoder[A2]): Encoder[(A1, A2)] =
    Encoders.tuple[A1, A2](e1, e2)

  val expectedDF = Seq(("one", 1L), ("two", 1L), ("three", 2L))
    .toDF("value", "count(1)")
    .as[(String, Long)]

  val wordsDF1 = Seq(
    ("one", "one"),
    ("two", "two"),
    ("three Three", "three"),
    ("three Three", "Three"),
    ("", "")
  ).toDF("line", "word").as[LineAndWord]

  val actualDF = WordCountDSApp.countWords(wordsDF1)
  actualDF.show()
  expectedDF.show()

  assertSmallDatasetEquality(actualDF, expectedDF, orderedComparison = false)
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何编写数据集编码器以支持将函数映射到 Scala Spark 中的 org.apache.spark.sql.Dataset[String] - How do I write a Dataset encoder to support mapping a function to a org.apache.spark.sql.Dataset[String] in Scala Spark array_intersect 的 Scala 错误:“不支持的文字类型类 org.apache.spark.sql.Dataset” - Scala Error for array_intersect: "Unsupported literal type class org.apache.spark.sql.Dataset" Scastie 将编译器错误渲染为“值 countByValue 不是 org.apache.spark.sql.Dataset[String] 的成员” - Scastie rendering compiler error as “value countByValue is not a member of org.apache.spark.sql.Dataset[String]” reducebykey 不是 org.apache.spark.sql.Dataset 的成员 - reducebykey is not a member of org.apache.spark.sql.Dataset 类型不匹配; 找到:所需单位:Array [org.apache.spark.sql.Dataset [org.apache.spark.sql.Row]] - type mismatch; found : Unit required: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] 无法将两个 DataFrame 上的左连接错误应用到(org.apache.spark.sql.Dataset、org.apache.spark.sql.Column、String) - Left join on two DataFrames giving error cannot be applied to (org.apache.spark.sql.Dataset, org.apache.spark.sql.Column, String) java.lang.ClassNotFoundException: org.apache.spark.sql.Dataset - java.lang.ClassNotFoundException: org.apache.spark.sql.Dataset Spark错误:无法找到存储在数据集中的类型的编码器 - Spark Error: Unable to find encoder for type stored in a Dataset 由:org.apache.spark.sql.Dataset上的java.lang.NullPointerException - Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset 值 collectAsMap 不是 org.apache.spark.sql.Dataset[(Any, Any)] 的成员 - Value collectAsMap is not a member of org.apache.spark.sql.Dataset[(Any, Any)]
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM