简体   繁体   English

PySpark 无别名自加入

[英]PySpark Self Join without alias

I have a DF, I want to left_outer join with itself but I would liek to do it with pyspark api rather than alias.我有一个 DF,我想 left_outer 加入自己,但我想用 pyspark api 而不是别名来做。

So it is something like:所以它是这样的:

df = ...
df2 = df

df.join(df2, [df['SomeCol'] == df2['SomeOtherCol']], how='left_outer')

Interestingly this is incorrect.有趣的是,这是不正确的。 When I run it I get this error:当我运行它时,我收到此错误:

WARN Column: Constructing trivially true equals predicate, 'CAMPAIGN_ID#62L = CAMPAIGN_ID#62L'. Perhaps you need to use aliases.

Is there a way to do this without using alias?有没有办法在不使用别名的情况下做到这一点? Or a clean way with alias?还是使用别名的干净方式? Alias really makes the code a lot dirtier rather than using the pyspark api directly.别名确实使代码更脏,而不是直接使用 pyspark api。

The most clean way of using aliases is as follows.使用别名的最干净的方法如下。

Given the following Dataframe.给出以下 Dataframe。

df.show()
+---+----+---+
| ID|NAME|AGE|
+---+----+---+
|  1|John| 50|
|  2|Anna| 32|
|  3|Josh| 41|
|  4|Paul| 98|
+---+----+---+

In the following example, I am simply adding "2" to each of the column names so that each column has is unique name after the join.在下面的示例中,我只是将“2”添加到每个列名,以便每个列在连接后具有唯一的名称。

df2 = df.select([functions.col(c).alias(c + "2") for c in df.columns])

df = df.join(df2, on = df['NAME'] == df2['NAME2'], how='left_outer')

df.show()
+---+----+---+---+-----+----+
| ID|NAME|AGE|ID2|NAME2|AGE2|
+---+----+---+---+-----+----+
|  1|John| 50|  1| John|  50|
|  2|Anna| 32|  2| Anna|  32|
|  3|Josh| 41|  3| Josh|  41|
|  4|Paul| 98|  4| Paul|  98|
+---+----+---+---+-----+----+

If I just simply did a df.join(df).select("NAME") , pyspark does not know which column I want to select as they both have the exact same name.如果我只是简单地做了一个df.join(df).select("NAME") ,pyspark 不知道我想要哪一列 select 因为它们都具有完全相同的名称。 This will lead to errors like the following.这将导致如下错误。

AnalysisException: Reference 'NAME' is ambiguous, could be: NAME, NAME.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM