I have a Pyspark dataframe and I would like to split its rows into columns based on unique values of a given column, joining with values of the other column. For illustrative purposes, let me use the following example, where my original dataframe is df
.
df.show()
+-----+-----+
| col1| col2|
+-----+-----+
| z1| a1|
| z1| b2|
| z1| c3|
| x1| a1|
| x1| b2|
| x1| c3|
+-----+-----+
What I would like to do is to split on the unique values of col1
, thus generating a new column (say, col3
) by joining on the values of col2
. The resulting dataframe that I am after would look like the following:
+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
| z1| a1| x1 |
| z1| b2| x1 |
| z1| c3| x1 |
+-----+-----+-----+
This illustrative example only contains two unique values in col1
(ie z1
and x1
). Ideally, I would like to write a piece of code which automatically detects unique values in col1
and therefore generates a new corresponding column. Does anyone know where can I start from?
Edit : It is arbitrary that z1
and x1
end up being in col1
and col3
, respectively. It could definitely be the other way round since I am simply just interested in splitting by unique values.
Many thanks in advance,
Marioanzas
I think you're trying to group by col2
and collect a set of distinct col1
values.
import pyspark.sql.functions as F
df2 = df.groupBy('col2').agg(F.collect_set('col1').alias('col'))
df3 = df2.select(
'col2',
*[F.col('col')[i] for i in range(df2.select(F.max(F.size('col'))).head()[0])]
)
df3.show()
+----+------+------+
|col2|col[0]|col[1]|
+----+------+------+
| b2| z1| x1|
| a1| z1| x1|
| c3| z1| x1|
+----+------+------+
Then you can rearrange/rename the columns as you wish.
It's not obvious to me why z1
should go in col1
and x1
in col3
in your question. It could very well be the other way round - there is no way to tell from your logic.
You can join
using two conditions and dropDuplicates
:
df.alias('left').join(
df.alias('right'),
on = [
F.col('left.col2') == F.col('right.col2'),
F.col('left.col1') != F.col('right.col1')
]
).dropDuplicates(['col2']).show()
Output:
+----+----+----+----+
|col1|col2|col1|col2|
+----+----+----+----+
| x1| b2| z1| b2|
| x1| a1| z1| a1|
| x1| c3| z1| c3|
+----+----+----+----+
Then you can drop
and rename columns:
df.alias('left').join(
df.alias('right'),
on = [
F.col('left.col2') == F.col('right.col2'),
F.col('left.col1') != F.col('right.col1')
]
).dropDuplicates(['col2'])\
.drop(F.col('right.col2'))\
.select(
F.col('right.col1'), F.col('col2'), F.col('left.col1').alias('col3')
).show()
Output:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| z1| b2| x1|
| z1| a1| x1|
| z1| c3| x1|
+----+----+----+
You can also use SQL:
df.createOrReplaceTempView('df')
spark.sql(
"""
SELECT
l.col1 AS col1,
l.col2 AS col2,
r.col1 AS col3
FROM df AS l
INNER JOIN df AS r
ON l.col2 = r.col2
AND l.col1 <> r.col1
"""
).dropDuplicates(['col2']).show()
Output:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| z1| b2| x1|
| z1| a1| x1|
| z1| c3| x1|
+----+----+----+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.