PySpark 无别名自加入

Question

I have a DF, I want to left_outer join with itself but I would liek to do it with pyspark api rather than alias.我有一个 DF，我想 left_outer 加入自己，但我想用 pyspark api 而不是别名来做。

So it is something like:所以它是这样的：

df = ...
df2 = df

df.join(df2, [df['SomeCol'] == df2['SomeOtherCol']], how='left_outer')

Interestingly this is incorrect.有趣的是，这是不正确的。 When I run it I get this error:当我运行它时，我收到此错误：

WARN Column: Constructing trivially true equals predicate, 'CAMPAIGN_ID#62L = CAMPAIGN_ID#62L'. Perhaps you need to use aliases.

Is there a way to do this without using alias?有没有办法在不使用别名的情况下做到这一点？ Or a clean way with alias?还是使用别名的干净方式？ Alias really makes the code a lot dirtier rather than using the pyspark api directly.别名确实使代码更脏，而不是直接使用 pyspark api。

Answer 1

The most clean way of using aliases is as follows.使用别名的最干净的方法如下。

Given the following Dataframe.给出以下 Dataframe。

df.show()
+---+----+---+
| ID|NAME|AGE|
+---+----+---+
|  1|John| 50|
|  2|Anna| 32|
|  3|Josh| 41|
|  4|Paul| 98|
+---+----+---+

In the following example, I am simply adding "2" to each of the column names so that each column has is unique name after the join.在下面的示例中，我只是将“2”添加到每个列名，以便每个列在连接后具有唯一的名称。

df2 = df.select([functions.col(c).alias(c + "2") for c in df.columns])

df = df.join(df2, on = df['NAME'] == df2['NAME2'], how='left_outer')

df.show()
+---+----+---+---+-----+----+
| ID|NAME|AGE|ID2|NAME2|AGE2|
+---+----+---+---+-----+----+
|  1|John| 50|  1| John|  50|
|  2|Anna| 32|  2| Anna|  32|
|  3|Josh| 41|  3| Josh|  41|
|  4|Paul| 98|  4| Paul|  98|
+---+----+---+---+-----+----+

If I just simply did a df.join(df).select("NAME") , pyspark does not know which column I want to select as they both have the exact same name.如果我只是简单地做了一个df.join(df).select("NAME") ，pyspark 不知道我想要哪一列 select 因为它们都具有完全相同的名称。 This will lead to errors like the following.这将导致如下错误。

AnalysisException: Reference 'NAME' is ambiguous, could be: NAME, NAME.

PySpark 无别名自加入

问题描述

1 个解决方案

解决方案1
0 2021-12-23 21:47:08

PySpark 无别名自加入

问题描述

1 个解决方案

解决方案1 0 2021-12-23 21:47:08

解决方案1
0 2021-12-23 21:47:08