简体   繁体   English

在PySpark中连接多个列

[英]Joining multiple columns in PySpark

I would like to join two DataFrames that have column names in common. 我想加入两个具有相同列名的DataFrame。

my DataFrames are as follows: 我的数据帧如下:

>>> sample3
DataFrame[uid1: string, count1: bigint]
>>> sample4
DataFrame[uid1: string, count1: bigint]


sample3
     uid1  count1
0  John         3
1  Paul         4
2  George       5

sample4
     uid1  count1
0  John         3
1  Paul         4
2  George       5

(I am using the same DataFrame with a different name on purpose) (我故意使用相同的DataFrame并使用不同的名称)

I looked at JIRA issue 7197 for Spark and they address how to perform this join (this is inconsistent with the PySpark documentation). 我查看了Spark的JIRA问题7197 ,他们解决了如何执行此联接(这与PySpark文档不一致)。 However, the method they propose produces duplicate columns: 但是,他们建议的方法会产生重复的列:

>>> cond = (sample3.uid1 == sample4.uid1) & (sample3.count1 == sample4.count1)
>>> sample3.join(sample4, cond)
DataFrame[uid1: string, count1: bigint, uid1: string, count1: bigint]

I would like to get a result where the keys do not appear twice. 我想得到一个键不出现两次的结果。

I can do this with one column: 我可以用一栏来做到这一点:

>>>sample3.join(sample4, 'uid1')
DataFrame[uid1: string, count1: bigint, count1: bigint]

However, the same syntax does not apply to this method of joining and throws an error. 但是,相同的语法不适用于此连接方法,并会引发错误。

I would like to get the result: 我想得到结果:

DataFrame[uid1: string, count1: bigint]

I was wondering how this would be possible 我想知道这怎么可能

You can define the join cond use a list of keys, in your case: 您可以使用键列表定义连接条件:

sample3.join(sample4, ['uid1','count1']) sample3.join(sample4,['uid1','count1'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM