简体   繁体   中英

Joining multiple columns in PySpark

I would like to join two DataFrames that have column names in common.

my DataFrames are as follows:

>>> sample3
DataFrame[uid1: string, count1: bigint]
>>> sample4
DataFrame[uid1: string, count1: bigint]


sample3
     uid1  count1
0  John         3
1  Paul         4
2  George       5

sample4
     uid1  count1
0  John         3
1  Paul         4
2  George       5

(I am using the same DataFrame with a different name on purpose)

I looked at JIRA issue 7197 for Spark and they address how to perform this join (this is inconsistent with the PySpark documentation). However, the method they propose produces duplicate columns:

>>> cond = (sample3.uid1 == sample4.uid1) & (sample3.count1 == sample4.count1)
>>> sample3.join(sample4, cond)
DataFrame[uid1: string, count1: bigint, uid1: string, count1: bigint]

I would like to get a result where the keys do not appear twice.

I can do this with one column:

>>>sample3.join(sample4, 'uid1')
DataFrame[uid1: string, count1: bigint, count1: bigint]

However, the same syntax does not apply to this method of joining and throws an error.

I would like to get the result:

DataFrame[uid1: string, count1: bigint]

I was wondering how this would be possible

You can define the join cond use a list of keys, in your case:

sample3.join(sample4, ['uid1','count1'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM