[英]Searching and updating a Spark Dataset column with values from another Dataset
Java 8 and Spark 2.11:2.3.2 here.此处为 Java 8 和 Spark 2.11:2.3.2。 Although I would greatly prefer Java API answers, I do speak a wee bit of Scala so I will be able to understand any answers provided in it!
虽然我非常喜欢 Java API 答案,但我确实会说一点 Scala,所以我能够理解其中提供的任何答案! But Java if at all possible (please)!
但是 Java 如果可能的话(请)!
I have two datasets with different schema, with the exception of a common " model_number
" (string) column: that exists on both.我有两个具有不同架构的数据集,除了一个常见的“
model_number
”(字符串)列:两者都存在。
For each row in my first Dataset (we'll call that d1
), I need to scan/search the second Dataset (" d2
") to see if there is a row with the same model_number
, and if so, update another d2
column.对于我的第一个数据集中的每一行(我们称之为
d1
),我需要扫描/搜索第二个数据集(“ d2
”)以查看是否有一行具有相同的model_number
,如果是,则更新另一个d2
列.
Here are my Dataset schemas:这是我的数据集模式:
d1
===========
model_number : string
desc : string
fizz : string
buzz : date
d2
===========
model_number : string
price : double
source : string
So again, if a d1
row has a model_number
of, say, 12345, and a d2
row also has the same model_number
, I want to update the d2.price
by multiplying it by 10.0
.因此,如果
d1
行的model_number
为 12345,而d2
行也具有相同的model_number
,我想通过将它乘以10.0
来更新d2.price
。
My best attempt thus far:迄今为止我最好的尝试:
// I *think* this would give me a 3rd dataset with all d1 and d2 columns, but only
// containing rows from d1 and d2 that have matching 'model_number' values
Dataset<Row> d3 = d1.join(d2, d1.col("model_number") == d2.col("model_number"));
// now I just need to update d2.price based on matching
Dataset<Row> d4 = d3.withColumn("adjusted_price", d3.col("price") * 10.0);
Can anyone help me cross the finish line here?谁能帮我在这里越过终点线? Thanks in advance!
提前致谢!
Some points here, as @VamsiPrabhala mentioned in the comment, the function that you need to use is join
on your specific fields.这里有几点,正如@VamsiPrabhala 在评论中提到的,您需要使用的功能是
join
您的特定字段。 Regarding the " update
", you need to take in mind that df
, ds
and rdd
in spark
are immutable, so you can not update
them.关于“
update
”,你需要记住spark
中的df
、 ds
和rdd
是不可变的,所以你不能update
它们。 So, the solution here is, after join
your df
's, you need to perform your calculation, in this case multiplication, in a select
or using withColumn
and then select
.所以,这里的解决方案是,在
join
你的df
之后,你需要执行你的计算,在这种情况下是乘法,在select
或使用withColumn
然后select
。 In other words, you can not update the column, but you can create the new df
with the " new
" column.换句话说,您不能更新列,但可以使用“
new
”列创建新的df
。
Example:例子:
Input data:
+------------+------+------+----+
|model_number| desc| fizz|buzz|
+------------+------+------+----+
| model_a|desc_a|fizz_a|null|
| model_b|desc_b|fizz_b|null|
+------------+------+------+----+
+------------+-----+--------+
|model_number|price| source|
+------------+-----+--------+
| model_a| 10.0|source_a|
| model_b| 20.0|source_b|
+------------+-----+--------+
using join
will output:使用
join
将输出:
val joinedDF = d1.join(d2, "model_number")
joinedDF.show()
+------------+------+------+----+-----+--------+
|model_number| desc| fizz|buzz|price| source|
+------------+------+------+----+-----+--------+
| model_a|desc_a|fizz_a|null| 10.0|source_a|
| model_b|desc_b|fizz_b|null| 20.0|source_b|
+------------+------+------+----+-----+--------+
applying your calculation:应用你的计算:
joinedDF.withColumn("price", col("price") * 10).show()
output:
+------------+------+------+----+-----+--------+
|model_number| desc| fizz|buzz|price| source|
+------------+------+------+----+-----+--------+
| model_a|desc_a|fizz_a|null| 100.0|source_a|
| model_b|desc_b|fizz_b|null| 200.0|source_b|
+------------+------+------+----+-----+--------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.