繁体   English   中英

Scala Spark 结构化流过滤器按结构字段内的时间戳类型

[英]Scala Spark Structured Streaming Filter by TimestampType within Struct Field

我在定义的架构中有多个数据类型。 试图找到一种按 TimestampType 过滤的好方法,将所有 TimeStampType 字段从 long 转换为 datetime。 我能够在 StringType 上的 stream 中使用 .dtypes 过滤,但在尝试使用 .dtypes 过滤 StructFields 和 StructTypes 时遇到问题。 有没有办法只过滤结构中的时间戳类型? 下面是我使用的 sudo 代码,它在 Scala 2.11 中使用 spark 结构化流

val isoDateFormatter = "yyyy-MM-dd'T'HH:mm:ss'Z'"

val ExampleDataFrameLoad = spark
    .readStream
    .format("kafka")
    .option("subscribe", topics.keys.mkString(","))
    .options(kafkaConfig)
    .load()
    .select($"key".cast(StringType), $"value".cast(StringType), $"topic")
    // Convert untyped dataframe to dataset
    .as[(String, String, String)]
    // Merge all manifests for vehicle in minibatch
    .groupByKey(_._1)
    //Start of merge
    .flatMapGroupsWithState(OutputMode.Append, GroupStateTimeout.ProcessingTimeTimeout)(mergeGroup)
    // .select($"key".cast(StringType),from_json($"value",schema).as("manifest"))
    .select($"_1".alias("key"), $"_2".alias("jsonvalues"))
    .select("key", "jsonvalues.*")

val ExampleDataFrame = ExampleDataFrameLoad
 ExampleDataFrame.dtypes.foreach(println)
/* Returns      
(key,StringType)
(contractVersion,StringType)
(metaData,StructType(StructField(Test,StringType,true), StructField(DateUtc,TimestampType,true) 
*/

*Uses the following objects
   import java.sql.Timestamp

 object ManifestClasses {
 
 final case class ProductManifestDocument(
                                        contractVersion: Option[String],
                                        metaData: DocumentMetaData 
                                      )
                                      
final case class DocumentMetaData(
                                 Test: Option[String]
                                 DateUtc: Timestamp
                               ) 
*/
 

 ExampleDataFrame
    //brings back data fields with types
    .dtypes
   //Currently returning empty but works for StringType 
    .filter(_._2 == "TimestampType")

    .map(_._1)
    //Tranforms all timestamp longs to yyyy-MM-dd'T'HH:mm:ss'Z' format
    .foldLeft(ExampleDataFrame)((df, colName) => df.withColumn(colName, date_format(col(colName), isoDateFormatter)))

您可以转换 structType 如下 -

就地更改时间戳类型的日期格式

加载测试数据

val ExampleDataFrame = spark.sql("select key, contractVersion, metaData from values " +
      "('k1', 'v1', named_struct('Test', 'test1', 'DateUtc', cast(unix_timestamp() as timestamp))) " +
      "T(key, contractVersion, metaData)")
    ExampleDataFrame.show(false)
    ExampleDataFrame.printSchema()
    ExampleDataFrame.dtypes.foreach(println)
    /**
      * +---+---------------+----------------------------+
      * |key|contractVersion|metaData                    |
      * +---+---------------+----------------------------+
      * |k1 |v1             |[test1, 2020-06-23 14:39:55]|
      * +---+---------------+----------------------------+
      *
      * root
      * |-- key: string (nullable = false)
      * |-- contractVersion: string (nullable = false)
      * |-- metaData: struct (nullable = false)
      * |    |-- Test: string (nullable = false)
      * |    |-- DateUtc: timestamp (nullable = true)
      *
      * (key,StringType)
      * (contractVersion,StringType)
      * (metaData,StructType(StructField(Test,StringType,false), StructField(DateUtc,TimestampType,true)))
      */

将时间戳从结构转换为特定日期格式


    val isoDateFormatter = "yyyy-MM-dd'T'HH:mm:ss'Z'"
    val processedDF = ExampleDataFrame.withColumn("metaData", struct($"metaData.Test",
      date_format($"metaData.DateUtc", isoDateFormatter)))
      processedDF.show(false)

    /**
      * +---+---------------+-----------------------------+
      * |key|contractVersion|metaData                     |
      * +---+---------------+-----------------------------+
      * |k1 |v1             |[test1, 2020-06-23T14:51:17Z]|
      * +---+---------------+-----------------------------+
      */

Update-1(基于评论)

从 structType 中提取timestamptype类型作为单独的列

ExampleDataFrame.schema
      .filter(_.dataType.isInstanceOf[StructType])
      .flatMap(s => s.dataType.asInstanceOf[StructType]
        .filter(_.dataType == TimestampType)
        .map(f => s"${s.name}.${f.name}")
      )
      .foldLeft(ExampleDataFrame)((df, colName) => df.withColumn(colName, date_format(col(colName), isoDateFormatter)))
      .show(false)

    /**
      * +---+---------------+----------------------------+--------------------+
      * |key|contractVersion|metaData                    |metaData.DateUtc    |
      * +---+---------------+----------------------------+--------------------+
      * |k1 |v1             |[test1, 2020-06-23 21:40:36]|2020-06-23T21:40:36Z|
      * +---+---------------+----------------------------+--------------------+
      */
// use df.select("`metaData.DateUtc`") to select the columns having dot(.)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM