简体   繁体   中英

Spark 1.6 scala create data rows

I have the following code.

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val baseDF = sqlContext.read.json(fileFullPath)

My json has 2 fields of interest: ProductId and Quantity. What I am looking for

{
    "sales": {
        "saledate": "17Mar2008",
        "sale": [{
            "productid": 1,
            "quantity": 10
        }, {
            "productid": 2,
            "quantity": 1
        }, {
            "productid": 3,
            "quantity": 3
        }, {
            "productid": 4,
            "quantity": 5
        }]
    }
}

I want to change this to an spark RDD or DF which has 2 columns, productid and quantity but multiple rows based on quantity. I want 1 for each quantity.

In above example product 1 has 10 rows, product 2 has 1, product 3 has 3 and product 4 has 5 rows for a total of 19 rows ie # rows = sum(quantity).

Any help appreciated. I am using spark 1.6.2 and scala.

This should do the thing:

import org.apache.spark.sql.functions._

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._

val baseDF = sqlContext.read.json(fileFullPath)
val listFromQuantity = udf { quantity: Int => List.fill(quantity)(quantity) }

baseDF.select(explode($"sales.sale")).select($"col.productId", explode(listFromQuantity($"col.quantity"))).show()

Which returns:

+---------+--------+
|productId|quantity|
+---------+--------+
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        2|       1|
|        3|       3|
|        3|       3|
|        3|       3|
|        4|       5|
|        4|       5|
|        4|       5|
|        4|       5|
|        4|       5|
+---------+--------+

If you'd like to have a single quantity in the second column (eg have value 1 instead of 5 ) you should replace List.fill(quantity)(quantity) with List.fill(quantity)(1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM