簡體   English   中英

Scala Spark 結構化流過濾器按結構字段內的時間戳類型

[英]Scala Spark Structured Streaming Filter by TimestampType within Struct Field

我在定義的架構中有多個數據類型。 試圖找到一種按 TimestampType 過濾的好方法,將所有 TimeStampType 字段從 long 轉換為 datetime。 我能夠在 StringType 上的 stream 中使用 .dtypes 過濾,但在嘗試使用 .dtypes 過濾 StructFields 和 StructTypes 時遇到問題。 有沒有辦法只過濾結構中的時間戳類型? 下面是我使用的 sudo 代碼,它在 Scala 2.11 中使用 spark 結構化流

val isoDateFormatter = "yyyy-MM-dd'T'HH:mm:ss'Z'"

val ExampleDataFrameLoad = spark
    .readStream
    .format("kafka")
    .option("subscribe", topics.keys.mkString(","))
    .options(kafkaConfig)
    .load()
    .select($"key".cast(StringType), $"value".cast(StringType), $"topic")
    // Convert untyped dataframe to dataset
    .as[(String, String, String)]
    // Merge all manifests for vehicle in minibatch
    .groupByKey(_._1)
    //Start of merge
    .flatMapGroupsWithState(OutputMode.Append, GroupStateTimeout.ProcessingTimeTimeout)(mergeGroup)
    // .select($"key".cast(StringType),from_json($"value",schema).as("manifest"))
    .select($"_1".alias("key"), $"_2".alias("jsonvalues"))
    .select("key", "jsonvalues.*")

val ExampleDataFrame = ExampleDataFrameLoad
 ExampleDataFrame.dtypes.foreach(println)
/* Returns      
(key,StringType)
(contractVersion,StringType)
(metaData,StructType(StructField(Test,StringType,true), StructField(DateUtc,TimestampType,true) 
*/

*Uses the following objects
   import java.sql.Timestamp

 object ManifestClasses {
 
 final case class ProductManifestDocument(
                                        contractVersion: Option[String],
                                        metaData: DocumentMetaData 
                                      )
                                      
final case class DocumentMetaData(
                                 Test: Option[String]
                                 DateUtc: Timestamp
                               ) 
*/
 

 ExampleDataFrame
    //brings back data fields with types
    .dtypes
   //Currently returning empty but works for StringType 
    .filter(_._2 == "TimestampType")

    .map(_._1)
    //Tranforms all timestamp longs to yyyy-MM-dd'T'HH:mm:ss'Z' format
    .foldLeft(ExampleDataFrame)((df, colName) => df.withColumn(colName, date_format(col(colName), isoDateFormatter)))

您可以轉換 structType 如下 -

就地更改時間戳類型的日期格式

加載測試數據

val ExampleDataFrame = spark.sql("select key, contractVersion, metaData from values " +
      "('k1', 'v1', named_struct('Test', 'test1', 'DateUtc', cast(unix_timestamp() as timestamp))) " +
      "T(key, contractVersion, metaData)")
    ExampleDataFrame.show(false)
    ExampleDataFrame.printSchema()
    ExampleDataFrame.dtypes.foreach(println)
    /**
      * +---+---------------+----------------------------+
      * |key|contractVersion|metaData                    |
      * +---+---------------+----------------------------+
      * |k1 |v1             |[test1, 2020-06-23 14:39:55]|
      * +---+---------------+----------------------------+
      *
      * root
      * |-- key: string (nullable = false)
      * |-- contractVersion: string (nullable = false)
      * |-- metaData: struct (nullable = false)
      * |    |-- Test: string (nullable = false)
      * |    |-- DateUtc: timestamp (nullable = true)
      *
      * (key,StringType)
      * (contractVersion,StringType)
      * (metaData,StructType(StructField(Test,StringType,false), StructField(DateUtc,TimestampType,true)))
      */

將時間戳從結構轉換為特定日期格式


    val isoDateFormatter = "yyyy-MM-dd'T'HH:mm:ss'Z'"
    val processedDF = ExampleDataFrame.withColumn("metaData", struct($"metaData.Test",
      date_format($"metaData.DateUtc", isoDateFormatter)))
      processedDF.show(false)

    /**
      * +---+---------------+-----------------------------+
      * |key|contractVersion|metaData                     |
      * +---+---------------+-----------------------------+
      * |k1 |v1             |[test1, 2020-06-23T14:51:17Z]|
      * +---+---------------+-----------------------------+
      */

Update-1(基於評論)

從 structType 中提取timestamptype類型作為單獨的列

ExampleDataFrame.schema
      .filter(_.dataType.isInstanceOf[StructType])
      .flatMap(s => s.dataType.asInstanceOf[StructType]
        .filter(_.dataType == TimestampType)
        .map(f => s"${s.name}.${f.name}")
      )
      .foldLeft(ExampleDataFrame)((df, colName) => df.withColumn(colName, date_format(col(colName), isoDateFormatter)))
      .show(false)

    /**
      * +---+---------------+----------------------------+--------------------+
      * |key|contractVersion|metaData                    |metaData.DateUtc    |
      * +---+---------------+----------------------------+--------------------+
      * |k1 |v1             |[test1, 2020-06-23 21:40:36]|2020-06-23T21:40:36Z|
      * +---+---------------+----------------------------+--------------------+
      */
// use df.select("`metaData.DateUtc`") to select the columns having dot(.)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM