简体   繁体   English

Spark SQL的Scala API - TimestampType - 找不到org.apache.spark.sql.types.TimestampType的编码器

[英]Spark SQL's Scala API - TimestampType - No Encoder found for org.apache.spark.sql.types.TimestampType

I am using Spark 2.1 with Scala 2.11 on a Databricks notebook 我在Databricks笔记本上使用Spark 2.1和Scala 2.11

What is exactly TimestampType ? 什么是TimestampType?

​We know from ​​SparkSQL's documentation​ that's the official timestamp type is TimestampType, which is apparently an alias for java.sql.Timestamp : 我们知道,从SparkSQL的文档那是官方的时间戳类型是TimestampType,这显然是对的java.sql.Timestamp一个别名:

TimestampType can be found here in the​ SparkSQL's Scala ​API TimestampType可以在SparkSQL的Scala API中找到

We have a difference when using a schema and the Dataset API 使用模式和数据集API时,我们有所不同

When parsing {"time":1469501297,"action":"Open"} from the Databricks' Scala Structured Streaming example 解析Databricks的Scala Structured Streaming示例中的 {"time":1469501297,"action":"Open"}

Using a Json schema --> OK (I do prefer using the elegant Dataset API) : 使用Json架构 - > OK (我更喜欢使用优雅的Dataset API):

val jsonSchema = new StructType().add("time", TimestampType).add("action", StringType)

val staticInputDF = 
  spark
    .read
    .schema(jsonSchema)
    .json(inputPath)

Using the Dataset API --> KO : No Encoder found for TimestampType 使用数据集API - > KO :找不到TimestampType的编码器

Creating the Event class 创建Event类

import org.apache.spark.sql.types._
case class Event(action: String, time: TimestampType)
--> defined class Event

Errors when reading the events from DBFS on databricks. 在databricks上从DBFS读取事件时出错。

Note: we don't get the error when using java.sql.Timestamp as a type for "time" 注意:使用java.sql.Timestamp作为“time”的类型时,我们不会收到错误

val path = "/databricks-datasets/structured-streaming/events/"
val events = spark.read.json(path).as[Event]

Error message 错误信息

java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.types.TimestampType
- field (class: "org.apache.spark.sql.types.TimestampType", name: "time")
- root class: 

Combining the schema read method .schema(jsonSchema) and the as[Type] method containing the type java.sql.Timestamp will solve this issue. 结合模式读取方法.schema(jsonSchema)和包含类型java.sql.Timestampas[Type]方法将解决此问题。 The idea came to be after reading from the Structured Streaming documentation Creating streaming DataFrames and streaming Datasets 这个想法是在阅读Structured Streaming文档创建流式数据框架和流式数据集之后得出的

These examples generate streaming DataFrames that are untyped , meaning that the schema of the DataFrame is not checked at compile time, only checked at runtime when the query is submitted. 这些示例生成无类型的流式DataFrame,这意味着在编译时不检查DataFrame的架构,仅在提交查询时在运行时检查。 Some operations like map, flatMap, etc. need the type to be known at compile time. map,flatMap等一些操作需要在编译时知道类型。 To do those, you can convert these untyped streaming DataFrames to typed streaming Datasets using the same methods as static DataFrame. 要执行这些操作, 您可以使用与静态DataFrame相同的方法将这些无类型流式DataFrame转换为类型化流式数据集。

val path = "/databricks-datasets/structured-streaming/events/"

val jsonSchema = new StructType().add("time", TimestampType).add("action", StringType)

case class Event(action: String, time: java.sql.Timestamp)

val staticInputDS = 
  spark
    .read
    .schema(jsonSchema)
    .json(path)
    .as[Event]

staticInputDF.printSchema

Will output : 将输出:

root
 |-- time: timestamp (nullable = true)
 |-- action: string (nullable = true)

TimestampType is not an alias for java.sql.Timestamp , but rather a representation of a timestamp type for Spark internal usage. TimestampType不是java.sql.Timestamp的别名,而是Spark内部使用的时间戳类型的表示。 In general you don't want to use TimestampType in your code. 通常,您不希望在代码中使用TimestampType The idea is that java.sql.Timestamp is supported by Spark SQL natively, so you can define you event class as follows: 我们的想法是本地支持Spark SQL的java.sql.Timestamp ,因此您可以按如下方式定义事件类:

case class Event(action: String, time: java.sql.Timestamp)

Internally, Spark will then use TimestampType to model the type of a value at runtime, when compiling and optimizing your query, but this is not something you're interested in most of the time. 在内部,Spark将在编译和优化查询时使用TimestampType在运行时对值的类型进行建模,但这不是您在大多数时间感兴趣的内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 没有为org.apache.spark.sql.types.TimestampType定义隐式排序 - No implicit Ordering defined for org.apache.spark.sql.types.TimestampType Spark SQL不支持的数据类型TimestampType - Spark SQL Unsupported datatype TimestampType UnsupportedOperationException:找不到 org.apache.spark.sql.Row 的编码器 - UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row Scala Spark 结构化流过滤器按结构字段内的时间戳类型 - Scala Spark Structured Streaming Filter by TimestampType within Struct Field 如何在 spark DataFrame 中将数据类型格式化为 TimestampType - Scala - How to format datatype to TimestampType in spark DataFrame- Scala 如何编写数据集编码器以支持将函数映射到 Scala Spark 中的 org.apache.spark.sql.Dataset[String] - How do I write a Dataset encoder to support mapping a function to a org.apache.spark.sql.Dataset[String] in Scala Spark Spark将TimestampType转换为格式为yyyyMMddHHmm的字符串 - Spark convert TimestampType to String of format yyyyMMddHHmm 如何在字符串中创建spark中的TimestampType列 - How to create TimestampType column in spark from string 扩展org.apache.spark.sql.Row功能:Spark Scala - Extend org.apache.spark.sql.Row functionality : Spark Scala spark - scala:不是org.apache.spark.sql.Row的成员 - spark - scala: not a member of org.apache.spark.sql.Row
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM