[英]Create a new column in a dataset using a case class structure
假设表的以下架构 - Places :
root
|-- place_id: string (nullable = true)
|-- street_address: string (nullable = true)
|-- city: string (nullable = true)
|-- state_province: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- country: string (nullable = true)
|-- neighborhood: string (nullable = true)
val places
的类型为Dataset[Row]
我有以下案例 class:
case class csm(
city: Option[String] = None,
stateProvince: Option[String] = None,
country: Option[String] = None
)
我将如何 go 关于更改或创建具有以下架构的新数据集
root
|-- place_id: string (nullable = true)
|-- street_address: string (nullable = true)
|-- subpremise: string (nullable = true)
|-- city: string (nullable = true)
|-- state_province: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- country: string (nullable = true)
|-- neighborhood: string (nullable = true)
|-- csm: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state_province: string (nullable = true)
| |-- country: string (nullable = true)
我一直在研究withColumn
方法,它们似乎需要 UDF,这里的挑战是我必须手动指定对于这个用例来说很容易的列,但是随着我的问题的扩大,手动维护它们会很困难
将此用作参考: https://intellipaat.com/community/16433/how-to-add-a-new-struct-column-to-a-dataframe
在你的情况下 class 声明你有stateProvince
参数,但在你的 dataframe 有state_province
列。
我不确定这是否不是错字,所以首先 - 一些快速-n-dirty 未经过彻底测试的 camelCase 到 snake_case 转换器,以防万一:
def normalize(x: String): String =
"([a-z])([A-Z])".r replaceAllIn(x, m => s"${m.group(1)}_${m.group(2).toLowerCase}")
接下来我们获取一个案例class的参数:
val case_class_params = Seq[csm]().toDF.columns
有了这个,我们现在可以为我们的案例 class struct
获取列:
val csm_cols = case_class_params.map(x => col(normalize(x)))
val df2 = df.withColumn("csm", struct(csm_cols:_*))
+--------+--------------+---------+--------------+-----------+------------+-----------+----------------------------------------+
|place_id|street_address|city |state_province|postal_code|country |neghborhood|csm |
+--------+--------------+---------+--------------+-----------+------------+-----------+----------------------------------------+
|123 |str_addr |some_city|some_province |some_zip |some_country|NA |{some_city, some_province, some_country}|
+--------+--------------+---------+--------------+-----------+------------+-----------+----------------------------------------+
root
|-- place_id: string (nullable = true)
|-- street_address: string (nullable = true)
|-- city: string (nullable = true)
|-- state_province: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- country: string (nullable = true)
|-- neghborhood: string (nullable = true)
|-- csm: struct (nullable = false)
| |-- city: string (nullable = true)
| |-- state_province: string (nullable = true)
| |-- country: string (nullable = true)
case class Source(
place_id: Option[String],
street_address: Option[String],
city: Option[String],
state_province: Option[String],
postal_code: Option[String],
country: Option[String],
neighborhood: Option[String]
)
case class Csm(
city: Option[String] = None,
stateProvince: Option[String] = None,
country: Option[String] = None
)
case class Result(
place_id: Option[String],
street_address: Option[String],
subpremise: Option[String],
city: Option[String],
state_province: Option[String],
postal_code: Option[String],
country: Option[String],
neighborhood: Option[String],
csm: Csm
)
import spark.implicits._
val sourceDF = Seq(
Source(
Some("s-1-1"),
Some("s-1-2"),
Some("s-1-3"),
Some("s-1-4"),
Some("s-1-5"),
Some("s-1-6"),
Some("s-1-7")
),
Source(
Some("s-2-1"),
Some("s-2-2"),
Some("s-2-3"),
Some("s-2-4"),
Some("s-2-5"),
Some("s-2-6"),
Some("s-2-7")
)
).toDF()
val resultDF = sourceDF
.map(r => {
Result(
Some(r.getAs[String]("place_id")),
Some(r.getAs[String]("street_address")),
Some("set your value"),
Some(r.getAs[String]("city")),
Some(r.getAs[String]("state_province")),
Some(r.getAs[String]("postal_code")),
Some(r.getAs[String]("country")),
Some(r.getAs[String]("neighborhood")),
Csm(
Some(r.getAs[String]("city")),
Some(r.getAs[String]("state_province")),
Some(r.getAs[String]("country"))
)
)
})
.toDF()
resultDF.printSchema()
// root
// |-- place_id: string (nullable = true)
// |-- street_address: string (nullable = true)
// |-- subpremise: string (nullable = true)
// |-- city: string (nullable = true)
// |-- state_province: string (nullable = true)
// |-- postal_code: string (nullable = true)
// |-- country: string (nullable = true)
// |-- neighborhood: string (nullable = true)
// |-- csm: struct (nullable = true)
// | |-- city: string (nullable = true)
// | |-- stateProvince: string (nullable = true)
// | |-- country: string (nullable = true)
resultDF.show(false)
// +--------+--------------+--------------+-----+--------------+-----------+-------+------------+---------------------+
// |place_id|street_address|subpremise |city |state_province|postal_code|country|neighborhood|csm |
// +--------+--------------+--------------+-----+--------------+-----------+-------+------------+---------------------+
// |s-1-1 |s-1-2 |set your value|s-1-3|s-1-4 |s-1-5 |s-1-6 |s-1-7 |[s-1-3, s-1-4, s-1-6]|
// |s-2-1 |s-2-2 |set your value|s-2-3|s-2-4 |s-2-5 |s-2-6 |s-2-7 |[s-2-3, s-2-4, s-2-6]|
// +--------+--------------+--------------+-----+--------------+-----------+-------+------------+---------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.