简体   繁体   中英

Pyspark SQL dataframe map with multiple data types

I'm having a pyspark code in glue where I want to create a dataframe with map structure to be a combination of integer and string.

sample data:

{ "Candidates": [
    {
      "jobLevel": 6,
      "name": "Steven",
    },    {
      "jobLevel": 5,
      "name": "Abby",
    } ] }

Hence, I tried using the below code to create the map data type. But every time the integer data type jobLevel gets converted to string data type. Any suggestion to get this done by retaining the data type of the job level?

code used:

df = spark.sql("select Supervisor_name, 
           map('job_level', INT(job_level_name), 
          'name', employeeLogin) as Candidates 
     from dataset_1")

It is not possible for map values to have different types. Use a struct for this situation.

df = spark.sql("""
    select Supervisor_name, 
           struct(INT(job_level_name) as job_level, 
                  employeeLogin as name
                 ) as Candidates 
    from dataset_1
""")

I am new to pyspark:-). However, lets try parallelize and then define schema to desired;

js={ "Candidates": [
    {
      "jobLevel": 6,
      "name": "Steven",
    },    {
      "jobLevel": 5,
      "name": "Abby",
    } ] }



    from pyspark.sql.types import *
    df=sc.parallelize(js["Candidates"])
    schema = StructType([StructField('name', StringType(), True),
                         StructField('jobLevel', IntegerType(), True)])
    df1=spark.read.json(df, schema)
    df1.show(truncate=False)
    df1.printSchema()

I get:

+------+--------+
|name  |jobLevel|
+------+--------+
|Steven|6       |
|Abby  |5       |
+------+--------+

root
 |-- name: string (nullable = true)
 |-- jobLevel: integer (nullable = true)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM