简体   繁体   English

使用定义的StructType转换Spark数据帧的值

[英]Cast values of a Spark dataframe using a defined StructType

Is there a way to cast all the values of a dataframe using a StructType ? 有没有一种方法可以使用StructType转换数据帧的所有值?

Let me explain my question using an example : 让我用一个例子解释我的问题:

Let's say that we obtained a dataframe after reading from a file(I am providing a code which generates this dataframe, but in my real world project, I am obtaining this dataframe after reading from a file): 假设我们从文件读取后获得了一个数据框(我提供了生成该数据框的代码,但是在我的真实世界项目中,我从文件读取后获得了该数据框):

    import org.apache.spark.sql.{Row, SparkSession}
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.functions._
    import spark.implicits._
    val rows1 = Seq(
      Row("1", Row("a", "b"), "8.00", Row("1","2")),
      Row("2", Row("c", "d"), "9.00", Row("3","4"))
    )

    val rows1Rdd = spark.sparkContext.parallelize(rows1, 4)

    val schema1 = StructType(
      Seq(
        StructField("id", StringType, true),
        StructField("s1", StructType(
          Seq(
            StructField("x", StringType, true),
            StructField("y", StringType, true)
          )
        ), true),
        StructField("d", StringType, true),
        StructField("s2", StructType(
          Seq(
            StructField("u", StringType, true),
            StructField("v", StringType, true)
          )
        ), true)
      )
    )

    val df1 = spark.createDataFrame(rows1Rdd, schema1)

    println("Schema with nested struct")
    df1.printSchema()

    root
    |-- id: string (nullable = true)
    |-- s1: struct (nullable = true)
    |    |-- x: string (nullable = true)
    |    |-- y: string (nullable = true)
    |-- d: string (nullable = true)
    |-- s2: struct (nullable = true)
    |    |-- u: string (nullable = true)
    |    |-- v: string (nullable = true)

Now let's say that my client provided me the schema of the data he wants (which is equivalent to the schema of the read dataframe, but with different Datatypes (contains StringTypes, IntegerTypes ...)): 现在,假设我的客户端为我提供了他想要的数据的架构(与读取的数据框的架构等效,但是具有不同的数据类型(包含StringTypes,IntegerTypes ...)):

    val wantedSchema = StructType(
      Seq(
        StructField("id", IntegerType, true),
        StructField("s1", StructType(
          Seq(
            StructField("x", StringType, true),
            StructField("y", StringType, true)
          )
        ), true),
        StructField("d", DoubleType, true),
        StructField("s2", StructType(
          Seq(
            StructField("u", IntegerType, true),
            StructField("v", IntegerType, true)
          )
        ), true)
      )
    )

What's the best way to cast the dataframe's values using the provided StructType ? 使用提供的StructType转换数据框的值的最佳方法是什么?

It would be great if there's a method that we can apply on a dataframe, and it applies the new StructTypes by casting all the values by itself. 如果有一种方法可以应用到数据帧上,那就很不错了,它可以通过强制转换所有值来应用新的StructType。

PS: This is a small Dataframe which is used as an example, in my project the dataframe contains much more rows. PS:这是一个小的数据框,仅作为示例,在我的项目中,该数据框包含更多行。 If It was a small Dataframe with few columns, I could have done the cast easily, but in my case, I am looking for a smart solution to cast all the values by applying a StructType and without having to cast each column/value manually in the code. 如果这是一个只有几列的小型Dataframe,我可以很容易地进行转换,但就我而言,我正在寻找一种智能的解决方案,可以通过应用StructType来转换所有值,而不必手动转换每个列/值。编码。

i will be grateful for any help you can provide, Thanks a lot ! 我将很感激您能提供的任何帮助,非常感谢!

There's no automatic way to perform the conversion. 没有自动执行转换的方法。 You can express the conversion logic in Spark SQL, to convert everything in one pass - the resulting SQL might get quite big, though, if you have a lot of fields. 您可以在Spark SQL中表达转换逻辑,以便一次转换所有内容-但是,如果您有很多字段,那么生成的SQL可能会变得很大。 But at least you get to keep all your transformation in one place. 但是至少您可以将所有转换都放在一个地方。

Example: 例:

   df1.selectExpr("CAST (id AS INTEGER) as id",
    "STRUCT (s1.x, s1.y) AS s1",
    "CAST (d AS DECIMAL) as d",
    "STRUCT (CAST (s2.u AS INTEGER), CAST (s2.v AS INTEGER)) as s2").show()

One thing to watch out for is that whenever conversion fails (eg, when d is not a number), you'll get a NULL . 要注意的一件事是,每当转换失败时(例如,当d不是数字时),您将得到NULL One option is to run some validation prior to the conversion, and then filter out the df1 records to only convert the valid ones. 一种选择是在转换之前运行一些验证,然后过滤掉df1记录以仅转换有效的记录。

After a lot of researches, here's a generic solution to cast a dataframe following a schema : 经过大量研究,这里有一个通用的解决方案,可以按照模式强制转换数据框:

val castedDf = df1.selectExpr(wantedSchema.map(
  field => s"CAST ( ${field.name} As ${field.dataType.sql}) ${field.name}"
): _*)

Here's the schema of the casted dataframe : 这是强制转换的数据框的架构:

castedDf.printSchema
root
|-- id: integer (nullable = true)
|-- s1: struct (nullable = true)
|    |-- x: string (nullable = true)
|    |-- y: string (nullable = true)
|-- d: double (nullable = true)
|-- s2: struct (nullable = true)
|    |-- u: integer (nullable = true)
|    |-- v: integer (nullable = true)

I hope it's going to help someone, I spent 5 days looking for this simple/generic solution. 我希望它能对某人有所帮助,我花了5天的时间寻找这种简单/通用的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM