Mock data:
df_3 = [('2', '1'),
('3', '2'),
('4', '3'),]
df_3 = spark.sparkContext.parallelize(df_3).toDF(['id', 'id_parent'])
+---+---------+
| id|id_parent|
+---+---------+
| 2| 1|
| 3| 2|
| 4| 3|
+---+---------+
I want to left-join this dataframe on itself several times, first time:
result = df_3.alias("df_left").join(df_3.alias("df_right"), F.col("df_left.id_parent") == F.col("df_right.id"), "left")
+---+---------+----+---------+
| id|id_parent| id|id_parent|
+---+---------+----+---------+
| 4| 3| 3| 2|
| 2| 1|null| null|
| 3| 2| 2| 1|
+---+---------+----+---------+
But for the next time the columns will be ambiguous. I don't want to work with suffixes because this will be inside a while
loop. I also don't want to delete the first "id_parent" since I later need to do a coalesce
.
Summary: I want a large dataframe to keep left-joining on itself like my example until every id
is joined with an id_parent
that doesn't exist in id
.
The next iteration's result would look like:
+---+---------+----+-----------+----+-----------+
| id|id_parent|id_2|id_parent_2|id_3|id_parent_3|
+---+---------+----+-----------+----+-----------+
| 2| 1| -| -| -| -|
| 3| 2| 2| 1| -| -|
| 4| 3| 3| 2| 2| 1|
+---+---------+----+-----------+----+-----------+
And the final output would be:
+---+---------+--------------------+
| id|id_parent|ultimate_parent_node|
+---+---------+--------------------+
| 2| 1| 1|
| 3| 2| 1|
| 4| 3| 1|
+---+---------+--------------------+
You probably have your own while
clause. Just for the example I added a simple i < 4
. Here columns are being renamed and removed in a loop. After the loop, coalesce
with reverse order of columns can get you the "ultimate_parent_id".
i = 1
df_3 = df_3.toDF(*[f"{c}_{i}" for c in df_3.columns])
df_4 = df_3
while i < 4:
i += 1
df_4 = (
df_4
.join(df_3.toDF(*[f"{c[:-1]}{i}" for c in df_3.columns]), F.col(f"id_parent_{i-1}") == F.col(f"id_{i}"), 'left')
.drop(f'id_{i}')
)
df_4 = df_4.select(
F.col("id_1").alias("id"),
F.col("id_parent_1").alias("id_parent"),
F.coalesce(*df_4.columns[::-1]).alias("ultimate_parent_node")
)
df_4.show()
# +---+---------+--------------------+
# | id|id_parent|ultimate_parent_node|
# +---+---------+--------------------+
# | 2| 1| 1|
# | 4| 3| 1|
# | 3| 2| 1|
# +---+---------+--------------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.