简体   繁体   English

使用案例 class 结构在数据集中创建新列

[英]Create a new column in a dataset using a case class structure

Assuming the following schema for a table - Places :假设表的以下架构 - Places

root
|-- place_id: string (nullable = true)
|-- street_address: string (nullable = true)
|-- city: string (nullable = true)
|-- state_province: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- country: string (nullable = true)
|-- neighborhood: string (nullable = true)

val places is of type Dataset[Row] val places的类型为Dataset[Row]

and I have the following case class:我有以下案例 class:

case class csm(
    city: Option[String] = None,
    stateProvince: Option[String] = None,
    country: Option[String] = None
)

How would I go about altering or creating a new data set that has the following schema我将如何 go 关于更改或创建具有以下架构的新数据集

root
|-- place_id: string (nullable = true)
|-- street_address: string (nullable = true)
|-- subpremise: string (nullable = true)
|-- city: string (nullable = true)
|-- state_province: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- country: string (nullable = true)
|-- neighborhood: string (nullable = true)
|-- csm: struct (nullable = true)
|   |-- city: string (nullable = true)
|   |-- state_province: string (nullable = true)
|   |-- country: string (nullable = true)

I've been looking into withColumn methods and they seem to require UDFs, the challenge here being that I have to manually specify the columns which will be easy for this use case, but as my problem scales it will be difficult to manually maintain them我一直在研究withColumn方法,它们似乎需要 UDF,这里的挑战是我必须手动指定对于这个用例来说很容易的列,但是随着我的问题的扩大,手动维护它们会很困难

Used this as a reference: https://intellipaat.com/community/16433/how-to-add-a-new-struct-column-to-a-dataframe将此用作参考: https://intellipaat.com/community/16433/how-to-add-a-new-struct-column-to-a-dataframe

In your case class declaration you have stateProvince parameter, but in your dataframe there's state_province column instead.在你的情况下 class 声明你有stateProvince参数,但在你的 dataframe 有state_province列。

I'm not sure if it's not a typo, so first - some quick-n-dirty not-thoroughly-tested camelCase to snake_case converter just in case:我不确定这是否不是错字,所以首先 - 一些快速-n-dirty 未经过彻底测试的 camelCase 到 snake_case 转换器,以防万一:

def normalize(x: String): String = 
    "([a-z])([A-Z])".r replaceAllIn(x, m => s"${m.group(1)}_${m.group(2).toLowerCase}")

Next, let's get the parameters of a case class:接下来我们获取一个案例class的参数:

val case_class_params = Seq[csm]().toDF.columns

And with this, we can now get columns for our case class struct :有了这个,我们现在可以为我们的案例 class struct获取列:

val csm_cols = case_class_params.map(x => col(normalize(x)))
val df2 = df.withColumn("csm", struct(csm_cols:_*))
+--------+--------------+---------+--------------+-----------+------------+-----------+----------------------------------------+
|place_id|street_address|city     |state_province|postal_code|country     |neghborhood|csm                                     |
+--------+--------------+---------+--------------+-----------+------------+-----------+----------------------------------------+
|123     |str_addr      |some_city|some_province |some_zip   |some_country|NA         |{some_city, some_province, some_country}|
+--------+--------------+---------+--------------+-----------+------------+-----------+----------------------------------------+

root
 |-- place_id: string (nullable = true)
 |-- street_address: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state_province: string (nullable = true)
 |-- postal_code: string (nullable = true)
 |-- country: string (nullable = true)
 |-- neghborhood: string (nullable = true)
 |-- csm: struct (nullable = false)
 |    |-- city: string (nullable = true)
 |    |-- state_province: string (nullable = true)
 |    |-- country: string (nullable = true)
case class Source(
    place_id: Option[String],
    street_address: Option[String],
    city: Option[String],
    state_province: Option[String],
    postal_code: Option[String],
    country: Option[String],
    neighborhood: Option[String]
)
case class Csm(
    city: Option[String] = None,
    stateProvince: Option[String] = None,
    country: Option[String] = None
)

case class Result(
    place_id: Option[String],
    street_address: Option[String],
    subpremise: Option[String],
    city: Option[String],
    state_province: Option[String],
    postal_code: Option[String],
    country: Option[String],
    neighborhood: Option[String],
    csm: Csm
)

import spark.implicits._

val sourceDF = Seq(
  Source(
    Some("s-1-1"),
    Some("s-1-2"),
    Some("s-1-3"),
    Some("s-1-4"),
    Some("s-1-5"),
    Some("s-1-6"),
    Some("s-1-7")
  ),
  Source(
    Some("s-2-1"),
    Some("s-2-2"),
    Some("s-2-3"),
    Some("s-2-4"),
    Some("s-2-5"),
    Some("s-2-6"),
    Some("s-2-7")
  )
).toDF()

val resultDF = sourceDF
  .map(r => {
    Result(
      Some(r.getAs[String]("place_id")),
      Some(r.getAs[String]("street_address")),
      Some("set your value"),
      Some(r.getAs[String]("city")),
      Some(r.getAs[String]("state_province")),
      Some(r.getAs[String]("postal_code")),
      Some(r.getAs[String]("country")),
      Some(r.getAs[String]("neighborhood")),
      Csm(
        Some(r.getAs[String]("city")),
        Some(r.getAs[String]("state_province")),
        Some(r.getAs[String]("country"))
      )
    )
  })
  .toDF()
resultDF.printSchema()
//    root
//    |-- place_id: string (nullable = true)
//    |-- street_address: string (nullable = true)
//    |-- subpremise: string (nullable = true)
//    |-- city: string (nullable = true)
//    |-- state_province: string (nullable = true)
//    |-- postal_code: string (nullable = true)
//    |-- country: string (nullable = true)
//    |-- neighborhood: string (nullable = true)
//    |-- csm: struct (nullable = true)
//    |    |-- city: string (nullable = true)
//    |    |-- stateProvince: string (nullable = true)
//    |    |-- country: string (nullable = true)

resultDF.show(false)

//  +--------+--------------+--------------+-----+--------------+-----------+-------+------------+---------------------+
//  |place_id|street_address|subpremise    |city |state_province|postal_code|country|neighborhood|csm                  |
//  +--------+--------------+--------------+-----+--------------+-----------+-------+------------+---------------------+
//  |s-1-1   |s-1-2         |set your value|s-1-3|s-1-4         |s-1-5      |s-1-6  |s-1-7       |[s-1-3, s-1-4, s-1-6]|
//  |s-2-1   |s-2-2         |set your value|s-2-3|s-2-4         |s-2-5      |s-2-6  |s-2-7       |[s-2-3, s-2-4, s-2-6]|
//  +--------+--------------+--------------+-----+--------------+-----------+-------+------------+---------------------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用共享案例类结构制作新的案例类实例 - Using Shared Case Class Structure to Make New Case Class Instance 如何在不使用案例类但使用 StructType 的情况下创建数据集(不是数据帧)? - How to create Dataset (not DataFrame) without using case class but using StructType? 如何使用带有组合列的案例 Class 从 RDD 创建数据集 - How to create a DataSet from RDD using Case Class with composed columns 在不使用案例类的情况下从元组序列创建具有数据框的数据集 - create a dataset with data frame from sequence of tuples with out using case class 仅当在 main 方法之外定义 case 类以创建 Dataset[case class] 或 Dataframe[case class] 时才工作 - Working only when case class defined outside main method to create Dataset[case class] or Dataframe[case class] 如何在Apache Spark Dataset中为枚举列编写案例类? - How to write case class for enum column in Apache Spark Dataset? 将带有向量列的 Dataframe 转换为数据集 - 在 class 的情况下使用哪种类型 - Convert Dataframe with Vector column to Dataset - which type to be used in the case class Scala 2.11和Spark 2.0.0创建动态case类来编码Dataset - Scala 2.11 & Spark 2.0.0 Create dynamically case class to encode Dataset 为联接表创建新的案例类是否常见? - Is it common to create a new case class for join tables? 如何在Scala Spark中使用很多条件使用“ .withColumn”为数据集创建新列 - How to create a new column for dataset using “.withColumn” with many conditions in Scala Spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM