简体   繁体   中英

How to join two datasets in scala?

I have two data sets :

itemname       itemId       coupons
A               1            true
A               2            false


itemname      purchases
B               10
A               10
C               10

I need to get

itemname   itemId   coupons  purchases
A             1       true      10
A             2       false     10

Im doing -

 val mm = items.join(purchases, items("itemname") === purchases("itemname")).drop(items("itemname"))

Is this the correct way of doing this in spark scala ?

This code:

val itemsSchema =  List(
  StructField("itemname", StringType, nullable = false),
  StructField("itemid", IntegerType, nullable = false),
  StructField("coupons", BooleanType, nullable = false))

val purchasesSchema =  List(
  StructField("itemname", StringType, nullable = false),
  StructField("purchases", IntegerType, nullable = false))


val items = Seq(Row("A", 1, true), Row("A", 2, false))
val purchases = Seq(Row("A", 10), Row("B", 10), Row("C", 10))

val itemsDF = spark.createDataFrame(
  spark.sparkContext.parallelize(items),
  StructType(itemsSchema)
)

val purchasesDF = spark.createDataFrame(
  spark.sparkContext.parallelize(purchases),
  StructType(purchasesSchema)
)

purchasesDF.join(itemsDF, Seq("itemname")).show(false)

gives:

+--------+---------+------+-------+
|itemname|purchases|itemid|coupons|
+--------+---------+------+-------+
|A       |10       |1     |true   |
|A       |10       |2     |false  |
+--------+---------+------+-------+

hope this helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM