I have a PySpark dataframe that has a string column which contains a comma separated list of values (up to 5 values), like this:
+----+----------------------+
|col1|col2 |
+----+----------------------+
|1 | 'a1, b1, c1' |
|2 | 'a2, b2' |
|3 | 'a3, b3, c3, d3, e3' |
+----+----------------------+
I want to tokenize col2
and create 5 different columns out of col2
, possibly with null values if the tokenization returns less than 5 values:
+----+----+----+----+----+----+
|col1|col3|col4|col5|col6|col7|
+----+----+----+----+----+----+
|1 |'a1'|'b1'|'c1'|null|null|
|2 |'a2'|'b2'|null|null|null|
|3 |'a3'|'b3'|'c3'|'d3'|'e3'|
+----+----+----+----+----+----+
Any help will be much appreciated.
Just split that column and select.
df.withColumn('col2', split('col2', ', ')) \
.select(col('col1'), *[col('col2')[i].alias('col' + str(i + 3)) for i in range(0, 5)]) \
.show()
+----+----+----+----+----+----+
|col1|col3|col4|col5|col6|col7|
+----+----+----+----+----+----+
| 1| a1| b1| c1|null|null|
| 2| a2| b2|null|null|null|
| 3| a3| b3| c3| d3| e3|
+----+----+----+----+----+----+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.