简体   繁体   English

更改 spark dataframe 列的架构

[英]Change schema of spark dataframe column

I have a pyspark dataframe with column "Student".我有一个 pyspark dataframe 列“学生”。

One entry of data is as follows:一项数据如下:

{
   "Student" : {
       "m" : {
           "name" : {"s" : "john"},
           "score": {"s" : "165"}
       }
   }
}

I want to change the schema of this column, so that the entry looks as follows:我想更改此列的架构,以便条目如下所示:

{
    "Student" : 
    {
        "m" : 
        {
            "StudentDetails" : 
            {
                "m" : 
                {
                    "name" : {"s" : "john"},
                    "score": {"s" : "165"}
                }
            }
        }
    } 
}

The problem is that the Student field can also be null in the dataframe.问题是Student字段也可以是dataframe中的null。 So I want to retain the null values but change the schema of not null values.所以我想保留 null 值,但更改不是 null 值的架构。 I have used a udf for the above process which works.我在上述过程中使用了 udf 。

        def Helper_ChangeSchema(row):
            #null check
            if row is None:
                return None
            #change schema
            data = row.asDict(True)
            return {"m":{"StudentDetails":data}}

but udf is a black box for spark.但 udf 是火花的黑匣子。 Is there any method to do the same using inbuilt spark functions or sql queries.有没有任何方法可以使用内置的 spark 函数或 sql 查询来做同样的事情。

It works exactly like in this answer .它的工作原理与此答案完全相同。 Just add another nested level in the struct:只需在结构中添加另一个嵌套级别:

Either as SQL expression:作为 SQL 表达式:

processedDf = df.withColumn("student", F.expr("named_struct('m', named_struct('student_details', student))"))

or in Python code using the struct function :或在 Python 代码中使用结构 function

processedDf = df.withColumn("student", F.struct(F.struct(F.col("student")).alias('m')))

Both versions have the same result:两个版本的结果相同:

root
 |-- student: struct (nullable = false)
 |    |-- m: struct (nullable = false)
 |    |    |-- student_details: struct (nullable = true)
 |    |    |    |-- m: struct (nullable = true)
 |    |    |    |    |-- name: struct (nullable = true)
 |    |    |    |    |    |-- s: string (nullable = true)
 |    |    |    |    |-- score: struct (nullable = true)
 |    |    |    |    |    |-- s: string (nullable = true)

Both approaches work also fine with empty rows.这两种方法也适用于空行。 Using this input data使用此输入数据

data ='{"student" : {"m" : {"name" : {"s" : "john"},"score": {"s" : "165"}}}}'
data2='{"student": null }'
df = spark.read.json(sc.parallelize([data, data2]))

processedDf.show(truncate=False) prints processedDf.show(truncate=False)打印

+---------------------+
|student              |
+---------------------+
|[[[[[john], [165]]]]]|
|[[]]                 |
+---------------------+


EDIT : if the whole row should be set to null instead of the fields of the struct, you can add a when 编辑:如果整行应该设置为 null 而不是结构的字段,您可以添加一个when

processedDf = df.withColumn("student", F.when(F.col("student").isNull(), F.lit(None)).otherwise(F.struct(F.struct(F.col("student")).alias('m'))))

This will result in the same schema, but a different output for the null row:这将导致相同的架构,但 null 行的 output 不同:

+---------------------+
|student              |
+---------------------+
|[[[[[john], [165]]]]]|
|null                 |
+---------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM