I would like to join two DataFrames that have column names in common.
my DataFrames are as follows:
>>> sample3
DataFrame[uid1: string, count1: bigint]
>>> sample4
DataFrame[uid1: string, count1: bigint]
sample3
uid1 count1
0 John 3
1 Paul 4
2 George 5
sample4
uid1 count1
0 John 3
1 Paul 4
2 George 5
(I am using the same DataFrame with a different name on purpose)
I looked at JIRA issue 7197 for Spark and they address how to perform this join (this is inconsistent with the PySpark documentation). However, the method they propose produces duplicate columns:
>>> cond = (sample3.uid1 == sample4.uid1) & (sample3.count1 == sample4.count1)
>>> sample3.join(sample4, cond)
DataFrame[uid1: string, count1: bigint, uid1: string, count1: bigint]
I would like to get a result where the keys do not appear twice.
I can do this with one column:
>>>sample3.join(sample4, 'uid1')
DataFrame[uid1: string, count1: bigint, count1: bigint]
However, the same syntax does not apply to this method of joining and throws an error.
I would like to get the result:
DataFrame[uid1: string, count1: bigint]
I was wondering how this would be possible
You can define the join cond use a list of keys, in your case:
sample3.join(sample4, ['uid1','count1'])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.