I'm working in a Python 3 notebook in Azure Databricks with Spark 3.0.1.
I have the following DataFrame
+---+---------+
|ID |Name |
+---+---------+
|1 |John |
|2 |Michael |
+---+---------+
Which can be created with this code
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [(1,"John","Doe"),
(2,"Michael","Douglas")
]
schema = StructType([ \
StructField("ID",IntegerType(),True), \
StructField("Name",StringType(),True), \
])
df1 = spark.createDataFrame(data=data2,schema=schema)
df1.show(truncate=False)
I am trying to transform it into an object which can be serialized into json
with a single property called Entities
which is an array of the elements in the DataFrame.
Like this
{
"Entities": [
{
"ID": 1,
"Name": "John"
},
{
"ID": 2,
"Name": "Michael"
}
]
}
I've been trying to figure out how to do it but haven't had any luck so far. Can anyone point me in the right direction please?
try this:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql import functions as F
data2 = [
(1,"John","Doe"),
(2,"Michael","Douglas")
]
schema = StructType([
StructField("id",IntegerType(),True),
StructField("fname",StringType(),True),
StructField("lname",StringType(),True),
])
df1 = spark.createDataFrame(data2, schema)
df = (
df1
.withColumn("profile", F.struct("id", "fname"))
.groupby()
.agg(F.collect_list("profile").alias("Entities"))
)
df.select("Entities").coalesce(1).write.format('json').save('test', mode="overwrite")
Output file:
{
"Entities": [{
"id": 1,
"fname": "John"
}, {
"id": 2,
"fname": "Michael"
}]
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.