简体   繁体   English

使用来自另一个数据集的值搜索和更新 Spark 数据集列

[英]Searching and updating a Spark Dataset column with values from another Dataset

Java 8 and Spark 2.11:2.3.2 here.此处为 Java 8 和 Spark 2.11:2.3.2。 Although I would greatly prefer Java API answers, I do speak a wee bit of Scala so I will be able to understand any answers provided in it!虽然我非常喜欢 Java API 答案,但我确实会说一点 Scala,所以我能够理解其中提供的任何答案! But Java if at all possible (please)!但是 Java 如果可能的话(请)!

I have two datasets with different schema, with the exception of a common " model_number " (string) column: that exists on both.我有两个具有不同架构的数据集,除了一个常见的“ model_number ”(字符串)列:两者都存在。

For each row in my first Dataset (we'll call that d1 ), I need to scan/search the second Dataset (" d2 ") to see if there is a row with the same model_number , and if so, update another d2 column.对于我的第一个数据集中的每一行(我们称之为d1 ),我需要扫描/搜索第二个数据集(“ d2 ”)以查看是否有一行具有相同的model_number ,如果是,则更新另一个d2列.

Here are my Dataset schemas:这是我的数据集模式:

d1
===========
model_number : string
desc : string
fizz : string
buzz : date

d2
===========
model_number : string
price : double
source : string

So again, if a d1 row has a model_number of, say, 12345, and a d2 row also has the same model_number , I want to update the d2.price by multiplying it by 10.0 .因此,如果d1行的model_number为 12345,而d2行也具有相同的model_number ,我想通过将它乘以10.0来更新d2.price

My best attempt thus far:迄今为止我最好的尝试:

// I *think* this would give me a 3rd dataset with all d1 and d2 columns, but only
// containing rows from d1 and d2 that have matching 'model_number' values
Dataset<Row> d3 = d1.join(d2, d1.col("model_number") == d2.col("model_number"));

// now I just need to update d2.price based on matching
Dataset<Row> d4 = d3.withColumn("adjusted_price", d3.col("price") * 10.0);

Can anyone help me cross the finish line here?谁能帮我在这里越过终点线? Thanks in advance!提前致谢!

Some points here, as @VamsiPrabhala mentioned in the comment, the function that you need to use is join on your specific fields.这里有几点,正如@VamsiPrabhala 在评论中提到的,您需要使用的功能是join您的特定字段。 Regarding the " update ", you need to take in mind that df , ds and rdd in spark are immutable, so you can not update them.关于“ update ”,你需要记住spark中的dfdsrdd是不可变的,所以你不能update它们。 So, the solution here is, after join your df 's, you need to perform your calculation, in this case multiplication, in a select or using withColumn and then select .所以,这里的解决方案是,在join你的df之后,你需要执行你的计算,在这种情况下是乘法,在select或使用withColumn然后select In other words, you can not update the column, but you can create the new df with the " new " column.换句话说,您不能更新列,但可以使用“ new ”列创建新的df

Example:例子:

Input data:

+------------+------+------+----+
|model_number|  desc|  fizz|buzz|
+------------+------+------+----+
|     model_a|desc_a|fizz_a|null|
|     model_b|desc_b|fizz_b|null|
+------------+------+------+----+

+------------+-----+--------+
|model_number|price|  source|
+------------+-----+--------+
|     model_a| 10.0|source_a|
|     model_b| 20.0|source_b|
+------------+-----+--------+

using join will output:使用join将输出:

val joinedDF = d1.join(d2, "model_number")
joinedDF.show()

+------------+------+------+----+-----+--------+
|model_number|  desc|  fizz|buzz|price|  source|
+------------+------+------+----+-----+--------+
|     model_a|desc_a|fizz_a|null| 10.0|source_a|
|     model_b|desc_b|fizz_b|null| 20.0|source_b|
+------------+------+------+----+-----+--------+

applying your calculation:应用你的计算:

joinedDF.withColumn("price", col("price") * 10).show()

output:
+------------+------+------+----+-----+--------+
|model_number|  desc|  fizz|buzz|price|  source|
+------------+------+------+----+-----+--------+
|     model_a|desc_a|fizz_a|null| 100.0|source_a|
|     model_b|desc_b|fizz_b|null| 200.0|source_b|
+------------+------+------+----+-----+--------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM