Here is a sample data
val df4 = sc.parallelize(List(
("A1",45, "5", 1, 90),
("A2",60, "1", 1, 120),
("A6", 30, "9", 1, 450),
("A7", 89, "7", 1, 333),
("A7", 89, "4", 1, 320),
("A2",60, "5", 1, 22),
("A1",45, "22", 1, 1)
)).toDF("CID","age", "children", "marketplace_id","value")
thanks to @Shu for this piece of code
val df5 = df4.selectExpr("CID","""to_json(named_struct("id", children)) as item""", "value", "marketplace_id")
+---+-----------+-----+--------------+
|CID|item |value|marketplace_id|
+---+-----------+-----+--------------+
|A1 |{"id":"5"} |90 |1 |
|A2 |{"id":"1"} |120 |1 |
|A6 |{"id":"9"} |450 |1 |
|A7 |{"id":"7"} |333 |1 |
|A7 |{"id":"4"} |320 |1 |
|A2 |{"id":"5"} |22 |1 |
|A1 |{"id":"22"}|1 |1 |
+---+-----------+-----+--------------+
when you do df5.dtypes
(CID,StringType), (item,StringType), (value,IntegerType), (marketplace_id,IntegerType)
the column item is of string type, is there a way this can be of json/object type(if that is a thing)?
EDIT 1: I will describe what I am trying to achieve here, the above two steps remains same.
val w = Window.partitionBy("CID").orderBy(desc("value"))
val sorted_list = df5.withColumn("item", collect_list("item").over(w)).groupBy("CID").agg(max("item") as "item")
Output:
+---+-------------------------+
|CID|item |
+---+-------------------------+
|A6 |[{"id":"9"}] |
|A2 |[{"id":"1"}, {"id":"5"}] |
|A7 |[{"id":"7"}, {"id":"4"}] |
|A1 |[{"id":"5"}, {"id":"22"}]|
+---+-------------------------+
now whatever is inside [ ]
is a string. which is causing a problem for one of the tools we are using.
Sorry, pardon me I am new to scala, spark if this is a basic question.
Store json
data using struct
type, check below code.
scala> dfa
.withColumn("item_without_json",struct($"cid".as("id")))
.withColumn("item_as_json",to_json($"item_without_json"))
.show(false)
+---+-----------+-----+--------------+-----------------+------------+
|CID|item |value|marketplace_id|item_without_json|item_as_json|
+---+-----------+-----+--------------+-----------------+------------+
|A1 |{"id":"A1"}|90 |1 |[A1] |{"id":"A1"} |
|A2 |{"id":"A2"}|120 |1 |[A2] |{"id":"A2"} |
|A6 |{"id":"A6"}|450 |1 |[A6] |{"id":"A6"} |
|A7 |{"id":"A7"}|333 |1 |[A7] |{"id":"A7"} |
|A7 |{"id":"A7"}|320 |1 |[A7] |{"id":"A7"} |
|A2 |{"id":"A2"}|22 |1 |[A2] |{"id":"A2"} |
|A1 |{"id":"A1"}|1 |1 |[A1] |{"id":"A1"} |
+---+-----------+-----+--------------+-----------------+------------+
Based on the comment you made to have the dataset converted to json you would use:
df4
.select(collect_list(struct($"CID".as("id"))).as("items"))
.write()
.json(path)
The output will look like:
{"items":[{"id":"A1"},{"id":"A2"},{"id":"A6"},{"id":"A7"}, ...]}
If you need the thing in memory to pass down to a function, instead of write().json(...)
use toJSON
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.