[英]PySpark Self Join without alias
I have a DF, I want to left_outer join with itself but I would liek to do it with pyspark api rather than alias.我有一个 DF,我想 left_outer 加入自己,但我想用 pyspark api 而不是别名来做。
So it is something like:所以它是这样的:
df = ...
df2 = df
df.join(df2, [df['SomeCol'] == df2['SomeOtherCol']], how='left_outer')
Interestingly this is incorrect.有趣的是,这是不正确的。 When I run it I get this error:
当我运行它时,我收到此错误:
WARN Column: Constructing trivially true equals predicate, 'CAMPAIGN_ID#62L = CAMPAIGN_ID#62L'. Perhaps you need to use aliases.
Is there a way to do this without using alias?有没有办法在不使用别名的情况下做到这一点? Or a clean way with alias?
还是使用别名的干净方式? Alias really makes the code a lot dirtier rather than using the pyspark api directly.
别名确实使代码更脏,而不是直接使用 pyspark api。
The most clean way of using aliases is as follows.使用别名的最干净的方法如下。
Given the following Dataframe.给出以下 Dataframe。
df.show()
+---+----+---+
| ID|NAME|AGE|
+---+----+---+
| 1|John| 50|
| 2|Anna| 32|
| 3|Josh| 41|
| 4|Paul| 98|
+---+----+---+
In the following example, I am simply adding "2" to each of the column names so that each column has is unique name after the join.在下面的示例中,我只是将“2”添加到每个列名,以便每个列在连接后具有唯一的名称。
df2 = df.select([functions.col(c).alias(c + "2") for c in df.columns])
df = df.join(df2, on = df['NAME'] == df2['NAME2'], how='left_outer')
df.show()
+---+----+---+---+-----+----+
| ID|NAME|AGE|ID2|NAME2|AGE2|
+---+----+---+---+-----+----+
| 1|John| 50| 1| John| 50|
| 2|Anna| 32| 2| Anna| 32|
| 3|Josh| 41| 3| Josh| 41|
| 4|Paul| 98| 4| Paul| 98|
+---+----+---+---+-----+----+
If I just simply did a df.join(df).select("NAME")
, pyspark does not know which column I want to select as they both have the exact same name.如果我只是简单地做了一个
df.join(df).select("NAME")
,pyspark 不知道我想要哪一列 select 因为它们都具有完全相同的名称。 This will lead to errors like the following.这将导致如下错误。
AnalysisException: Reference 'NAME' is ambiguous, could be: NAME, NAME.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.