简体   繁体   中英

PySpark Self Join without alias

I have a DF, I want to left_outer join with itself but I would liek to do it with pyspark api rather than alias.

So it is something like:

df = ...
df2 = df

df.join(df2, [df['SomeCol'] == df2['SomeOtherCol']], how='left_outer')

Interestingly this is incorrect. When I run it I get this error:

WARN Column: Constructing trivially true equals predicate, 'CAMPAIGN_ID#62L = CAMPAIGN_ID#62L'. Perhaps you need to use aliases.

Is there a way to do this without using alias? Or a clean way with alias? Alias really makes the code a lot dirtier rather than using the pyspark api directly.

The most clean way of using aliases is as follows.

Given the following Dataframe.

df.show()
+---+----+---+
| ID|NAME|AGE|
+---+----+---+
|  1|John| 50|
|  2|Anna| 32|
|  3|Josh| 41|
|  4|Paul| 98|
+---+----+---+

In the following example, I am simply adding "2" to each of the column names so that each column has is unique name after the join.

df2 = df.select([functions.col(c).alias(c + "2") for c in df.columns])

df = df.join(df2, on = df['NAME'] == df2['NAME2'], how='left_outer')

df.show()
+---+----+---+---+-----+----+
| ID|NAME|AGE|ID2|NAME2|AGE2|
+---+----+---+---+-----+----+
|  1|John| 50|  1| John|  50|
|  2|Anna| 32|  2| Anna|  32|
|  3|Josh| 41|  3| Josh|  41|
|  4|Paul| 98|  4| Paul|  98|
+---+----+---+---+-----+----+

If I just simply did a df.join(df).select("NAME") , pyspark does not know which column I want to select as they both have the exact same name. This will lead to errors like the following.

AnalysisException: Reference 'NAME' is ambiguous, could be: NAME, NAME.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM