简体   繁体   English

如何在scala中加入两个数据集?

[英]How to join two datasets in scala?

I have two data sets : 我有两个数据集:

itemname       itemId       coupons
A               1            true
A               2            false


itemname      purchases
B               10
A               10
C               10

I need to get 我需要得到

itemname   itemId   coupons  purchases
A             1       true      10
A             2       false     10

Im doing - 我正在做 -

 val mm = items.join(purchases, items("itemname") === purchases("itemname")).drop(items("itemname"))

Is this the correct way of doing this in spark scala ? 这是在Spark Scala中执行此操作的正确方法吗?

This code: 这段代码:

val itemsSchema =  List(
  StructField("itemname", StringType, nullable = false),
  StructField("itemid", IntegerType, nullable = false),
  StructField("coupons", BooleanType, nullable = false))

val purchasesSchema =  List(
  StructField("itemname", StringType, nullable = false),
  StructField("purchases", IntegerType, nullable = false))


val items = Seq(Row("A", 1, true), Row("A", 2, false))
val purchases = Seq(Row("A", 10), Row("B", 10), Row("C", 10))

val itemsDF = spark.createDataFrame(
  spark.sparkContext.parallelize(items),
  StructType(itemsSchema)
)

val purchasesDF = spark.createDataFrame(
  spark.sparkContext.parallelize(purchases),
  StructType(purchasesSchema)
)

purchasesDF.join(itemsDF, Seq("itemname")).show(false)

gives: 得到:

+--------+---------+------+-------+
|itemname|purchases|itemid|coupons|
+--------+---------+------+-------+
|A       |10       |1     |true   |
|A       |10       |2     |false  |
+--------+---------+------+-------+

hope this helps 希望这可以帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM