PySpark DataFrame - Create a column from another dataframe

Question

I'm working in a Python 3 notebook in Azure Databricks with Spark 3.0.1.

I have the following DataFrame

+---+---------+
|ID |Name     |
+---+---------+
|1  |John     |
|2  |Michael  |
+---+---------+

Which can be created with this code

from pyspark.sql.types import StructType,StructField, StringType, IntegerType

data2 = [(1,"John","Doe"),
    (2,"Michael","Douglas")
  ]

schema = StructType([ \
    StructField("ID",IntegerType(),True), \
    StructField("Name",StringType(),True), \
  ])
 
df1 = spark.createDataFrame(data=data2,schema=schema)
df1.show(truncate=False)

I am trying to transform it into an object which can be serialized into json with a single property called Entities which is an array of the elements in the DataFrame.

Like this

{
    "Entities": [
        {
            "ID": 1,
            "Name": "John"
        },
        {
            "ID": 2,
            "Name": "Michael"
        }
    ]
}

I've been trying to figure out how to do it but haven't had any luck so far. Can anyone point me in the right direction please?

Answer 1

try this:

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql import functions as F

data2 = [
    (1,"John","Doe"),
    (2,"Michael","Douglas")
]
schema = StructType([ 
    StructField("id",IntegerType(),True), 
    StructField("fname",StringType(),True), 
    StructField("lname",StringType(),True), 
  ])
df1 = spark.createDataFrame(data2, schema)

df = (
    df1
    .withColumn("profile", F.struct("id", "fname"))
    .groupby()
    .agg(F.collect_list("profile").alias("Entities"))  
)
df.select("Entities").coalesce(1).write.format('json').save('test', mode="overwrite")

Output file:

{
    "Entities": [{
        "id": 1,
        "fname": "John"
    }, {
        "id": 2,
        "fname": "Michael"
    }]
}

PySpark DataFrame - Create a column from another dataframe

Question

1 answers

solution1
1 ACCPTED 2021-03-19 22:28:17

PySpark DataFrame - Create a column from another dataframe

Question

1 answers

solution1 1 ACCPTED 2021-03-19 22:28:17

solution1
1 ACCPTED 2021-03-19 22:28:17