Joining multiple columns in PySpark

Question

I would like to join two DataFrames that have column names in common.

my DataFrames are as follows:

>>> sample3
DataFrame[uid1: string, count1: bigint]
>>> sample4
DataFrame[uid1: string, count1: bigint]


sample3
     uid1  count1
0  John         3
1  Paul         4
2  George       5

sample4
     uid1  count1
0  John         3
1  Paul         4
2  George       5

(I am using the same DataFrame with a different name on purpose)

I looked at JIRA issue 7197 for Spark and they address how to perform this join (this is inconsistent with the PySpark documentation). However, the method they propose produces duplicate columns:

>>> cond = (sample3.uid1 == sample4.uid1) & (sample3.count1 == sample4.count1)
>>> sample3.join(sample4, cond)
DataFrame[uid1: string, count1: bigint, uid1: string, count1: bigint]

I would like to get a result where the keys do not appear twice.

I can do this with one column:

>>>sample3.join(sample4, 'uid1')
DataFrame[uid1: string, count1: bigint, count1: bigint]

However, the same syntax does not apply to this method of joining and throws an error.

I would like to get the result:

DataFrame[uid1: string, count1: bigint]

I was wondering how this would be possible

Answer 1

You can define the join cond use a list of keys, in your case:

sample3.join(sample4, ['uid1','count1'])

Joining multiple columns in PySpark

Question

1 answers

solution1
0 2017-08-24 02:23:01

Joining multiple columns in PySpark

Question

1 answers

solution1 0 2017-08-24 02:23:01

solution1
0 2017-08-24 02:23:01