使用来自另一个数据集的值搜索和更新 Spark 数据集列

Question

Java 8 and Spark 2.11:2.3.2 here.此处为 Java 8 和 Spark 2.11:2.3.2。 Although I would greatly prefer Java API answers, I do speak a wee bit of Scala so I will be able to understand any answers provided in it!虽然我非常喜欢 Java API 答案，但我确实会说一点 Scala，所以我能够理解其中提供的任何答案！ But Java if at all possible (please)!但是 Java 如果可能的话（请）！

I have two datasets with different schema, with the exception of a common " model_number " (string) column: that exists on both.我有两个具有不同架构的数据集，除了一个常见的“ model_number ”（字符串）列：两者都存在。

For each row in my first Dataset (we'll call that d1 ), I need to scan/search the second Dataset (" d2 ") to see if there is a row with the same model_number , and if so, update another d2 column.对于我的第一个数据集中的每一行（我们称之为d1 ），我需要扫描/搜索第二个数据集（“ d2 ”）以查看是否有一行具有相同的model_number ，如果是，则更新另一个d2列.

Here are my Dataset schemas:这是我的数据集模式：

d1
===========
model_number : string
desc : string
fizz : string
buzz : date

d2
===========
model_number : string
price : double
source : string

So again, if a d1 row has a model_number of, say, 12345, and a d2 row also has the same model_number , I want to update the d2.price by multiplying it by 10.0 .因此，如果d1行的model_number为 12345，而d2行也具有相同的model_number ，我想通过将它乘以10.0来更新d2.price 。

My best attempt thus far:迄今为止我最好的尝试：

// I *think* this would give me a 3rd dataset with all d1 and d2 columns, but only
// containing rows from d1 and d2 that have matching 'model_number' values
Dataset<Row> d3 = d1.join(d2, d1.col("model_number") == d2.col("model_number"));

// now I just need to update d2.price based on matching
Dataset<Row> d4 = d3.withColumn("adjusted_price", d3.col("price") * 10.0);

Can anyone help me cross the finish line here?谁能帮我在这里越过终点线？ Thanks in advance!提前致谢！

Answer 1

Some points here, as @VamsiPrabhala mentioned in the comment, the function that you need to use is join on your specific fields.这里有几点，正如@VamsiPrabhala 在评论中提到的，您需要使用的功能是join您的特定字段。 Regarding the " update ", you need to take in mind that df , ds and rdd in spark are immutable, so you can not update them.关于“ update ”，你需要记住spark中的df 、 ds和rdd是不可变的，所以你不能update它们。 So, the solution here is, after join your df 's, you need to perform your calculation, in this case multiplication, in a select or using withColumn and then select .所以，这里的解决方案是，在join你的df之后，你需要执行你的计算，在这种情况下是乘法，在select或使用withColumn然后select 。 In other words, you can not update the column, but you can create the new df with the " new " column.换句话说，您不能更新列，但可以使用“ new ”列创建新的df 。

Example:例子：

Input data:

+------------+------+------+----+
|model_number|  desc|  fizz|buzz|
+------------+------+------+----+
|     model_a|desc_a|fizz_a|null|
|     model_b|desc_b|fizz_b|null|
+------------+------+------+----+

+------------+-----+--------+
|model_number|price|  source|
+------------+-----+--------+
|     model_a| 10.0|source_a|
|     model_b| 20.0|source_b|
+------------+-----+--------+

using join will output:使用join将输出：

val joinedDF = d1.join(d2, "model_number")
joinedDF.show()

+------------+------+------+----+-----+--------+
|model_number|  desc|  fizz|buzz|price|  source|
+------------+------+------+----+-----+--------+
|     model_a|desc_a|fizz_a|null| 10.0|source_a|
|     model_b|desc_b|fizz_b|null| 20.0|source_b|
+------------+------+------+----+-----+--------+

applying your calculation:应用你的计算：

joinedDF.withColumn("price", col("price") * 10).show()

output:
+------------+------+------+----+-----+--------+
|model_number|  desc|  fizz|buzz|price|  source|
+------------+------+------+----+-----+--------+
|     model_a|desc_a|fizz_a|null| 100.0|source_a|
|     model_b|desc_b|fizz_b|null| 200.0|source_b|
+------------+------+------+----+-----+--------+

使用来自另一个数据集的值搜索和更新 Spark 数据集列

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-01-24 16:34:40

使用来自另一个数据集的值搜索和更新 Spark 数据集列

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-01-24 16:34:40

解决方案1
2 已采纳 2020-01-24 16:34:40