简体   繁体   English

在Spark中访问多个数据集

[英]Accessing multiple datasets in spark

I have a use case where I want to use a value from another dataset. 我有一个用例,我想使用另一个数据集中的值。 For example: 例如:

Table 1: Items 表1:项目

Name | Price
------------
Apple |10

Mango| 20

Grape |30

Table 2 : Item_Quantity 表2:Item_Quantity

Name | Quantity
Apple |5
Mango| 2
Grape |2

I want to calculate total cost and prepare a final dataset. 我想计算总成本并准备一个最终数据集。

Cost
Name | Cost
Apple |50  (10*5)
Mango| 40  (20*2)
Grape |60   (30*2)

How can I achieve this in spark? 如何在火花中实现这一目标? Appreciate your help. 感谢您的帮助。

=================== ===================

Another use case: 另一个用例:

Need help with this one too.. 也需要帮助。

Table 1: Items 表1:项目

Name | Code | Quantity
-------------------
Apple-1 |APP | 10
Mango-1| MAN | 20
Grape-1|GRA | 30
Apple-2 |APP | 20
Mango-2| MAN | 30
Grape -2|GRA | 50


Table 2 : Item_CODE_Price

Code | Price
----------------
APP |5
MAN| 2
GRA |2

I want to calculate total cost using code to get the price and prepare a final dataset.

Cost
Name | Cost
--------------
Apple-1 |50  (10*5)
Mango-1| 40  (20*2)
Grape-1 |60   (30*2)
Apple-2 |100  (20*5)
Mango-2| 60  (30*2)
Grape-2 |100   (50*2)

You can join two tables with the same Name and create a new column with withColumn as below 您可以join两个具有相同Name表,并使用withColumn创建一个新column ,如下所示

  val df1 = spark.sparkContext.parallelize(Seq(
    ("Apple",10),
    ("Mango",20),
    ("Grape",30)
  )).toDF("Name","Price" )


  val df2 = spark.sparkContext.parallelize(Seq(
    ("Apple",5),
    ("Mango",2),
    ("Grape",2)
  )).toDF("Name","Quantity" )


  //join and create new column
  val newDF = df1.join(df2, Seq("Name"))
    .withColumn("Cost", $"Price" * $"Quantity")

  newDF.show(false)

Output: 输出:

+-----+-----+--------+----+
|Name |Price|Quantity|Cost|
+-----+-----+--------+----+
|Grape|30   |2       |60  |
|Mango|20   |2       |40  |
|Apple|10   |5       |50  |
+-----+-----+--------+----+

The second case is you just need to join with Code and drop columns that you don't want in final as 第二种情况是,您只需要加入Code并删除最终不需要的列即可

val newDF = df2.join(df1, Seq("CODE"))
    .withColumn("Cost", $"Price" * $"Quantity")
    .drop("Code", "Price", "Quantity")

This example is in scala, there won't be much difference if you need in java 此示例在scala中,如果您需要在java中并没有太大区别

Hope this helps! 希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM