简体   繁体   中英

How to split Spark dataframe rows into columns?

I have a Pyspark dataframe and I would like to split its rows into columns based on unique values of a given column, joining with values of the other column. For illustrative purposes, let me use the following example, where my original dataframe is df .

df.show()
+-----+-----+
| col1| col2|
+-----+-----+
|   z1|   a1|
|   z1|   b2|
|   z1|   c3|
|   x1|   a1|
|   x1|   b2|
|   x1|   c3|
+-----+-----+

What I would like to do is to split on the unique values of col1 , thus generating a new column (say, col3 ) by joining on the values of col2 . The resulting dataframe that I am after would look like the following:

+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
|   z1|   a1|  x1 |
|   z1|   b2|  x1 |
|   z1|   c3|  x1 |
+-----+-----+-----+

This illustrative example only contains two unique values in col1 (ie z1 and x1 ). Ideally, I would like to write a piece of code which automatically detects unique values in col1 and therefore generates a new corresponding column. Does anyone know where can I start from?

Edit : It is arbitrary that z1 and x1 end up being in col1 and col3 , respectively. It could definitely be the other way round since I am simply just interested in splitting by unique values.

Many thanks in advance,

Marioanzas

I think you're trying to group by col2 and collect a set of distinct col1 values.

import pyspark.sql.functions as F

df2 = df.groupBy('col2').agg(F.collect_set('col1').alias('col'))
df3 = df2.select(
    'col2',
    *[F.col('col')[i] for i in range(df2.select(F.max(F.size('col'))).head()[0])]
)

df3.show()
+----+------+------+
|col2|col[0]|col[1]|
+----+------+------+
|  b2|    z1|    x1|
|  a1|    z1|    x1|
|  c3|    z1|    x1|
+----+------+------+

Then you can rearrange/rename the columns as you wish.

It's not obvious to me why z1 should go in col1 and x1 in col3 in your question. It could very well be the other way round - there is no way to tell from your logic.

You can join using two conditions and dropDuplicates :

df.alias('left').join(
    df.alias('right'),
    on = [
          F.col('left.col2') == F.col('right.col2'),
          F.col('left.col1') != F.col('right.col1')
          ]
).dropDuplicates(['col2']).show()

Output:

+----+----+----+----+
|col1|col2|col1|col2|
+----+----+----+----+
|  x1|  b2|  z1|  b2|
|  x1|  a1|  z1|  a1|
|  x1|  c3|  z1|  c3|
+----+----+----+----+

Then you can drop and rename columns:

df.alias('left').join(
    df.alias('right'),
    on = [
          F.col('left.col2') == F.col('right.col2'),
          F.col('left.col1') != F.col('right.col1')
          ]
).dropDuplicates(['col2'])\
.drop(F.col('right.col2'))\
.select(
    F.col('right.col1'), F.col('col2'), F.col('left.col1').alias('col3')
).show()

Output:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|  z1|  b2|  x1|
|  z1|  a1|  x1|
|  z1|  c3|  x1|
+----+----+----+

You can also use SQL:

df.createOrReplaceTempView('df')

spark.sql(
    """
    SELECT
      l.col1 AS col1,
      l.col2 AS col2,
      r.col1 AS col3 
    FROM df AS l
    INNER JOIN df AS r
    ON l.col2 = r.col2
    AND l.col1 <> r.col1
    """
).dropDuplicates(['col2']).show()

Output:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|  z1|  b2|  x1|
|  z1|  a1|  x1|
|  z1|  c3|  x1|
+----+----+----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM