在PySpark中連接多個列

Question

我想加入兩個具有相同列名的DataFrame。

我的數據幀如下：

>>> sample3
DataFrame[uid1: string, count1: bigint]
>>> sample4
DataFrame[uid1: string, count1: bigint]


sample3
     uid1  count1
0  John         3
1  Paul         4
2  George       5

sample4
     uid1  count1
0  John         3
1  Paul         4
2  George       5

（我故意使用相同的DataFrame並使用不同的名稱）

我查看了Spark的JIRA問題7197 ，他們解決了如何執行此聯接（這與PySpark文檔不一致）。 但是，他們建議的方法會產生重復的列：

>>> cond = (sample3.uid1 == sample4.uid1) & (sample3.count1 == sample4.count1)
>>> sample3.join(sample4, cond)
DataFrame[uid1: string, count1: bigint, uid1: string, count1: bigint]

我想得到一個鍵不出現兩次的結果。

我可以用一欄來做到這一點：

>>>sample3.join(sample4, 'uid1')
DataFrame[uid1: string, count1: bigint, count1: bigint]

但是，相同的語法不適用於此連接方法，並會引發錯誤。

我想得到結果：

DataFrame[uid1: string, count1: bigint]

我想知道這怎么可能

Answer 1

您可以使用鍵列表定義連接條件：

sample3.join（sample4，['uid1'，'count1']）

在PySpark中連接多個列

問題描述

1 個解決方案

解決方案1
0 2017-08-24 02:23:01

在PySpark中連接多個列

問題描述

1 個解決方案

解決方案1 0 2017-08-24 02:23:01

解決方案1
0 2017-08-24 02:23:01