tokenizing a pyspark dataframe column and stroing in new columns

Question

I have a PySpark dataframe that has a string column which contains a comma separated list of values (up to 5 values), like this:

+----+----------------------+
|col1|col2                  |
+----+----------------------+
|1   | 'a1, b1, c1'         |
|2   | 'a2, b2'             |
|3   | 'a3, b3, c3, d3, e3' |
+----+----------------------+

I want to tokenize col2 and create 5 different columns out of col2 , possibly with null values if the tokenization returns less than 5 values:

+----+----+----+----+----+----+
|col1|col3|col4|col5|col6|col7|
+----+----+----+----+----+----+
|1   |'a1'|'b1'|'c1'|null|null|
|2   |'a2'|'b2'|null|null|null|
|3   |'a3'|'b3'|'c3'|'d3'|'e3'|
+----+----+----+----+----+----+

Any help will be much appreciated.

Answer 1

Just split that column and select.

df.withColumn('col2', split('col2', ', ')) \
  .select(col('col1'), *[col('col2')[i].alias('col' + str(i + 3)) for i in range(0, 5)]) \
  .show()

+----+----+----+----+----+----+
|col1|col3|col4|col5|col6|col7|
+----+----+----+----+----+----+
|   1|  a1|  b1|  c1|null|null|
|   2|  a2|  b2|null|null|null|
|   3|  a3|  b3|  c3|  d3|  e3|
+----+----+----+----+----+----+

tokenizing a pyspark dataframe column and stroing in new columns

Question

1 answers

solution1
1 ACCPTED 2020-08-31 03:03:02

tokenizing a pyspark dataframe column and stroing in new columns

Question

1 answers

solution1 1 ACCPTED 2020-08-31 03:03:02

solution1
1 ACCPTED 2020-08-31 03:03:02