I have a data frame, it has multiple list columns and converts a JSON array column.
used below logic but not working any idea?
def test(test1,test2):
d = {'data': [{'marks': a, 'grades': t} for a, t in zip(test1, test2)]}
return d
UDF defined as an array type as below and tried to invoke in with column but does not work it out any idea?
arrayToMapUDF = udf(test ,ArrayType(StringType()))
df.withcolumn("jsonarraycolumn", arrayToMapUDF(col("col"),col("col2")))
marks | grades |
---|---|
[100, 150, 200, 300, 400] | [0.01, 0.02, 0.03, 0.04, 0.05] |
needs to be converted as below.
marks | grades | Json-array-column |
---|---|---|
[100, 150, 200, 300, 400] | [0.01, 0.02, 0.03, 0.04, 0.05] | {attribute:[{marks: 1000, |
grades: 0.01}, | ||
{marks: 15000, | ||
grade: 0.02}, | ||
{marks: 2000, | ||
grades: 0.03} | ||
]} |
You can use StringType
because it's returning a JSON string, not an array of strings. You can also use json.dumps
to convert the dictionary to a JSON string.
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import json
def test(test1,test2):
d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
return json.dumps(d)
arrayToMapUDF = F.udf(test, StringType())
df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))
df2.show(truncate=False)
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount |discount |jsonarraycolumn |
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{"amount": 1000, "discount": 0.01}, {"amount": 15000, "discount": 0.02}, {"amount": 2000, "discount": 0.03}, {"amount": 3000, "discount": 0.04}, {"amount": 4000, "discount": 0.05}]|
+-------------------------------+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
If you don't want the quotes,
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import json
def test(test1,test2):
d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
return json.dumps(d).replace('"', '')
arrayToMapUDF = F.udf(test, StringType())
df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))
df2.show(truncate=False)
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount |discount |jsonarraycolumn |
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{amount: 1000, discount: 0.01}, {amount: 15000, discount: 0.02}, {amount: 2000, discount: 0.03}, {amount: 3000, discount: 0.04}, {amount: 4000, discount: 0.05}]|
+-------------------------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
If you want a real JSON type column:
def test(test1,test2):
d = [{'amount': a, 'discount': t} for a, t in zip(test1, test2)]
return d
arrayToMapUDF = F.udf(test,
ArrayType(
StructType([
StructField('amount', StringType()),
StructField('discount', StringType())
])
)
)
df2 = df.withColumn("jsonarraycolumn", arrayToMapUDF(F.col("amount"), F.col("discount")))
df2.show(truncate=False)
+-------------------------------+------------------------------+-----------------------------------------------------------------------+
|amount |discount |jsonarraycolumn |
+-------------------------------+------------------------------+-----------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[[1000, 0.01], [15000, 0.02], [2000, 0.03], [3000, 0.04], [4000, 0.05]]|
+-------------------------------+------------------------------+-----------------------------------------------------------------------+
df2.printSchema()
root
|-- amount: array (nullable = false)
| |-- element: integer (containsNull = false)
|-- discount: array (nullable = false)
| |-- element: double (containsNull = false)
|-- jsonarraycolumn: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- amount: string (nullable = true)
| | |-- discount: string (nullable = true)
To avoid using udf functions, you can use high-order functions :
import pyspark.sql.functions as f
transform_expr = "TRANSFORM(arrays_zip(amount, discount), value -> value)"
df = df.withColumn('jsonarraycolumn', f.to_json(f.expr(transform_expr)))
df.show(truncate=False)
Output:
+-------------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|amount |discount |jsonarraycolumn |
+-------------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1000, 15000, 2000, 3000, 4000]|[0.01, 0.02, 0.03, 0.04, 0.05]|[{"amount":1000.0,"discount":0.01},{"amount":15000.0,"discount":0.02},{"amount":2000.0,"discount":0.03},{"amount":3000.0,"discount":0.04},{"amount":4000.0,"discount":0.05}]|
+-------------------------------+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.