使用一個表更新Spark中的另一個表

Question

我有兩個表或dataframes ，我想用一個來更新另一個。 我也知道spark sql不支持update a set a.1= b.1 from b where a.2 = b.2 and a.update < b.update 。 請建議我如何實現此目標，因為這不可能產生火花。

表格1

+------+----+------+
|number|name|update|
+------+--- -------+
|     1|   a| 08-01|
|     2|   b| 08-02|
+------+----+------+

表2

    +------+----+------+
    |number|name|update|
    +------+--- -------+
    |     1|  a2| 08-03|
    |     3|   b| 08-02|
    +------+----+------+

我想得到這個：

    +------+----+------+
    |number|name|update|
    +------+--- -------+
    |     1|  a2| 08-03|
    |     2|   b| 08-02|
    |     3|   b| 08-02|
    +------+----+------+

還有其他方法可以做到這一點嗎？

Answer 1

使用pyspark ，你可以使用subtract()找到number的值table1不存在於table2 ，因此使用unionAll其中兩個表table1被過濾下來從丟失的觀測table2 。

diff = (table1.select('number')
        .subtract(table2.select('number'))
        .rdd.map(lambda x: x[0]).collect())

table2.unionAll(table1[table1.number.isin(diff)]).orderBy('number').show()
+------+----+------+
|number|name|update|
+------+----+------+
|     1|  a2| 08-03|
|     2|   b| 08-02|
|     3|   b| 08-02|
+------+----+------+

使用一個表更新Spark中的另一個表

問題描述

1 個解決方案

解決方案1
1 2016-10-11 09:39:09

使用一個表更新Spark中的另一個表

問題描述

1 個解決方案

解決方案1 1 2016-10-11 09:39:09

解決方案1
1 2016-10-11 09:39:09