I have the following code.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val baseDF = sqlContext.read.json(fileFullPath)
My json has 2 fields of interest: ProductId and Quantity. What I am looking for
{
"sales": {
"saledate": "17Mar2008",
"sale": [{
"productid": 1,
"quantity": 10
}, {
"productid": 2,
"quantity": 1
}, {
"productid": 3,
"quantity": 3
}, {
"productid": 4,
"quantity": 5
}]
}
}
I want to change this to an spark RDD or DF which has 2 columns, productid and quantity but multiple rows based on quantity. I want 1 for each quantity.
In above example product 1 has 10 rows, product 2 has 1, product 3 has 3 and product 4 has 5 rows for a total of 19 rows ie # rows = sum(quantity).
Any help appreciated. I am using spark 1.6.2 and scala.
This should do the thing:
import org.apache.spark.sql.functions._
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
val baseDF = sqlContext.read.json(fileFullPath)
val listFromQuantity = udf { quantity: Int => List.fill(quantity)(quantity) }
baseDF.select(explode($"sales.sale")).select($"col.productId", explode(listFromQuantity($"col.quantity"))).show()
Which returns:
+---------+--------+
|productId|quantity|
+---------+--------+
| 1| 10|
| 1| 10|
| 1| 10|
| 1| 10|
| 1| 10|
| 1| 10|
| 1| 10|
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 1|
| 3| 3|
| 3| 3|
| 3| 3|
| 4| 5|
| 4| 5|
| 4| 5|
| 4| 5|
| 4| 5|
+---------+--------+
If you'd like to have a single quantity in the second column (eg have value 1
instead of 5
) you should replace List.fill(quantity)(quantity)
with List.fill(quantity)(1)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.