简体   繁体   中英

Ambiguous column names in a series of self joins

Mock data:

df_3 = [('2', '1'),
        ('3', '2'),
        ('4', '3'),]

df_3 = spark.sparkContext.parallelize(df_3).toDF(['id', 'id_parent'])
+---+---------+
| id|id_parent|
+---+---------+
|  2|        1|
|  3|        2|
|  4|        3|
+---+---------+

I want to left-join this dataframe on itself several times, first time:

result = df_3.alias("df_left").join(df_3.alias("df_right"), F.col("df_left.id_parent") ==  F.col("df_right.id"), "left")
+---+---------+----+---------+
| id|id_parent|  id|id_parent|
+---+---------+----+---------+
|  4|        3|   3|        2|
|  2|        1|null|     null|
|  3|        2|   2|        1|
+---+---------+----+---------+

But for the next time the columns will be ambiguous. I don't want to work with suffixes because this will be inside a while loop. I also don't want to delete the first "id_parent" since I later need to do a coalesce .

Summary: I want a large dataframe to keep left-joining on itself like my example until every id is joined with an id_parent that doesn't exist in id .

The next iteration's result would look like:

+---+---------+----+-----------+----+-----------+
| id|id_parent|id_2|id_parent_2|id_3|id_parent_3|
+---+---------+----+-----------+----+-----------+
|  2|        1|   -|          -|   -|          -|
|  3|        2|   2|          1|   -|          -|
|  4|        3|   3|          2|   2|          1|
+---+---------+----+-----------+----+-----------+

And the final output would be:

+---+---------+--------------------+
| id|id_parent|ultimate_parent_node|
+---+---------+--------------------+
|  2|        1|                   1|
|  3|        2|                   1|
|  4|        3|                   1|
+---+---------+--------------------+

You probably have your own while clause. Just for the example I added a simple i < 4 . Here columns are being renamed and removed in a loop. After the loop, coalesce with reverse order of columns can get you the "ultimate_parent_id".

i = 1
df_3 = df_3.toDF(*[f"{c}_{i}" for c in df_3.columns])
df_4 = df_3
while i < 4:
    i += 1
    df_4 = (
        df_4
        .join(df_3.toDF(*[f"{c[:-1]}{i}" for c in df_3.columns]), F.col(f"id_parent_{i-1}") == F.col(f"id_{i}"), 'left')
        .drop(f'id_{i}')
    )
df_4 = df_4.select(
    F.col("id_1").alias("id"),
    F.col("id_parent_1").alias("id_parent"),
    F.coalesce(*df_4.columns[::-1]).alias("ultimate_parent_node")
)

df_4.show()
# +---+---------+--------------------+
# | id|id_parent|ultimate_parent_node|
# +---+---------+--------------------+
# |  2|        1|                   1|
# |  4|        3|                   1|
# |  3|        2|                   1|
# +---+---------+--------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM