简体   繁体   中英

Change schema of spark dataframe column

I have a pyspark dataframe with column "Student".

One entry of data is as follows:

{
   "Student" : {
       "m" : {
           "name" : {"s" : "john"},
           "score": {"s" : "165"}
       }
   }
}

I want to change the schema of this column, so that the entry looks as follows:

{
    "Student" : 
    {
        "m" : 
        {
            "StudentDetails" : 
            {
                "m" : 
                {
                    "name" : {"s" : "john"},
                    "score": {"s" : "165"}
                }
            }
        }
    } 
}

The problem is that the Student field can also be null in the dataframe. So I want to retain the null values but change the schema of not null values. I have used a udf for the above process which works.

        def Helper_ChangeSchema(row):
            #null check
            if row is None:
                return None
            #change schema
            data = row.asDict(True)
            return {"m":{"StudentDetails":data}}

but udf is a black box for spark. Is there any method to do the same using inbuilt spark functions or sql queries.

It works exactly like in this answer . Just add another nested level in the struct:

Either as SQL expression:

processedDf = df.withColumn("student", F.expr("named_struct('m', named_struct('student_details', student))"))

or in Python code using the struct function :

processedDf = df.withColumn("student", F.struct(F.struct(F.col("student")).alias('m')))

Both versions have the same result:

root
 |-- student: struct (nullable = false)
 |    |-- m: struct (nullable = false)
 |    |    |-- student_details: struct (nullable = true)
 |    |    |    |-- m: struct (nullable = true)
 |    |    |    |    |-- name: struct (nullable = true)
 |    |    |    |    |    |-- s: string (nullable = true)
 |    |    |    |    |-- score: struct (nullable = true)
 |    |    |    |    |    |-- s: string (nullable = true)

Both approaches work also fine with empty rows. Using this input data

data ='{"student" : {"m" : {"name" : {"s" : "john"},"score": {"s" : "165"}}}}'
data2='{"student": null }'
df = spark.read.json(sc.parallelize([data, data2]))

processedDf.show(truncate=False) prints

+---------------------+
|student              |
+---------------------+
|[[[[[john], [165]]]]]|
|[[]]                 |
+---------------------+


EDIT : if the whole row should be set to null instead of the fields of the struct, you can add a when

processedDf = df.withColumn("student", F.when(F.col("student").isNull(), F.lit(None)).otherwise(F.struct(F.struct(F.col("student")).alias('m'))))

This will result in the same schema, but a different output for the null row:

+---------------------+
|student              |
+---------------------+
|[[[[[john], [165]]]]]|
|null                 |
+---------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM