[英]Scala Spark Structured Streaming Filter by TimestampType within Struct Field
我在定義的架構中有多個數據類型。 試圖找到一種按 TimestampType 過濾的好方法,將所有 TimeStampType 字段從 long 轉換為 datetime。 我能夠在 StringType 上的 stream 中使用 .dtypes 過濾,但在嘗試使用 .dtypes 過濾 StructFields 和 StructTypes 時遇到問題。 有沒有辦法只過濾結構中的時間戳類型? 下面是我使用的 sudo 代碼,它在 Scala 2.11 中使用 spark 結構化流
val isoDateFormatter = "yyyy-MM-dd'T'HH:mm:ss'Z'"
val ExampleDataFrameLoad = spark
.readStream
.format("kafka")
.option("subscribe", topics.keys.mkString(","))
.options(kafkaConfig)
.load()
.select($"key".cast(StringType), $"value".cast(StringType), $"topic")
// Convert untyped dataframe to dataset
.as[(String, String, String)]
// Merge all manifests for vehicle in minibatch
.groupByKey(_._1)
//Start of merge
.flatMapGroupsWithState(OutputMode.Append, GroupStateTimeout.ProcessingTimeTimeout)(mergeGroup)
// .select($"key".cast(StringType),from_json($"value",schema).as("manifest"))
.select($"_1".alias("key"), $"_2".alias("jsonvalues"))
.select("key", "jsonvalues.*")
val ExampleDataFrame = ExampleDataFrameLoad
ExampleDataFrame.dtypes.foreach(println)
/* Returns
(key,StringType)
(contractVersion,StringType)
(metaData,StructType(StructField(Test,StringType,true), StructField(DateUtc,TimestampType,true)
*/
*Uses the following objects
import java.sql.Timestamp
object ManifestClasses {
final case class ProductManifestDocument(
contractVersion: Option[String],
metaData: DocumentMetaData
)
final case class DocumentMetaData(
Test: Option[String]
DateUtc: Timestamp
)
*/
ExampleDataFrame
//brings back data fields with types
.dtypes
//Currently returning empty but works for StringType
.filter(_._2 == "TimestampType")
.map(_._1)
//Tranforms all timestamp longs to yyyy-MM-dd'T'HH:mm:ss'Z' format
.foldLeft(ExampleDataFrame)((df, colName) => df.withColumn(colName, date_format(col(colName), isoDateFormatter)))
您可以轉換 structType 如下 -
就地更改時間戳類型的日期格式
val ExampleDataFrame = spark.sql("select key, contractVersion, metaData from values " +
"('k1', 'v1', named_struct('Test', 'test1', 'DateUtc', cast(unix_timestamp() as timestamp))) " +
"T(key, contractVersion, metaData)")
ExampleDataFrame.show(false)
ExampleDataFrame.printSchema()
ExampleDataFrame.dtypes.foreach(println)
/**
* +---+---------------+----------------------------+
* |key|contractVersion|metaData |
* +---+---------------+----------------------------+
* |k1 |v1 |[test1, 2020-06-23 14:39:55]|
* +---+---------------+----------------------------+
*
* root
* |-- key: string (nullable = false)
* |-- contractVersion: string (nullable = false)
* |-- metaData: struct (nullable = false)
* | |-- Test: string (nullable = false)
* | |-- DateUtc: timestamp (nullable = true)
*
* (key,StringType)
* (contractVersion,StringType)
* (metaData,StructType(StructField(Test,StringType,false), StructField(DateUtc,TimestampType,true)))
*/
val isoDateFormatter = "yyyy-MM-dd'T'HH:mm:ss'Z'"
val processedDF = ExampleDataFrame.withColumn("metaData", struct($"metaData.Test",
date_format($"metaData.DateUtc", isoDateFormatter)))
processedDF.show(false)
/**
* +---+---------------+-----------------------------+
* |key|contractVersion|metaData |
* +---+---------------+-----------------------------+
* |k1 |v1 |[test1, 2020-06-23T14:51:17Z]|
* +---+---------------+-----------------------------+
*/
timestamptype
類型作為單獨的列ExampleDataFrame.schema
.filter(_.dataType.isInstanceOf[StructType])
.flatMap(s => s.dataType.asInstanceOf[StructType]
.filter(_.dataType == TimestampType)
.map(f => s"${s.name}.${f.name}")
)
.foldLeft(ExampleDataFrame)((df, colName) => df.withColumn(colName, date_format(col(colName), isoDateFormatter)))
.show(false)
/**
* +---+---------------+----------------------------+--------------------+
* |key|contractVersion|metaData |metaData.DateUtc |
* +---+---------------+----------------------------+--------------------+
* |k1 |v1 |[test1, 2020-06-23 21:40:36]|2020-06-23T21:40:36Z|
* +---+---------------+----------------------------+--------------------+
*/
// use df.select("`metaData.DateUtc`") to select the columns having dot(.)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.