I have two table or dataframes
, and I want to using one to update another one. Also I have know spark sql does not support update a set a.1= b.1 from b where a.2 = b.2 and a.update < b.update
. Please suggest me how can i achieve this as it is not possible in spark.
table1
+------+----+------+
|number|name|update|
+------+--- -------+
| 1| a| 08-01|
| 2| b| 08-02|
+------+----+------+
table2
+------+----+------+
|number|name|update|
+------+--- -------+
| 1| a2| 08-03|
| 3| b| 08-02|
+------+----+------+
I want to get this:
+------+----+------+
|number|name|update|
+------+--- -------+
| 1| a2| 08-03|
| 2| b| 08-02|
| 3| b| 08-02|
+------+----+------+
Are there have any other way to do this in spark?
Using pyspark
, you could use subtract()
to find the number
values of table1
not present in table2
, and consequently use unionAll
of the two tables where table1
is filtered down to the missing observations from table2
.
diff = (table1.select('number')
.subtract(table2.select('number'))
.rdd.map(lambda x: x[0]).collect())
table2.unionAll(table1[table1.number.isin(diff)]).orderBy('number').show()
+------+----+------+
|number|name|update|
+------+----+------+
| 1| a2| 08-03|
| 2| b| 08-02|
| 3| b| 08-02|
+------+----+------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.