How do you create new columns from each element in a string with spark/pyspark

Question

I am trying to take a column in Spark (using pyspark) that has string values like 'A1', 'C2', and 'B9' and create new columns with each element in the string. How can I extract values from strings to create a new column?

How do I turn this:

| id | col_s |
|----|-------|
| 1  | 'A1'  |
| 2  | 'C2'  |

into this:

| id | col_s | col_1 | col_2 |
|----|-------|-------|-------|
| 1  | 'A1'  | 'A'   |  '1'  |
| 2  | 'C2'  | 'C'   |  '2'  |

I have been looking through the docs unsuccessfully.

Answer 1

I was able to answer my own question 5 minutes after posting it here...

split_col = pyspark.sql.functions.split(df['COL_NAME'], "")
df = df.withColumn('COL_NAME_CHAR', split_col.getItem(0))
df = df.withColumn('COL_NAME_NUM', split_col.getItem(1))

Answer 2

You can use expr (read here ) and substr (read here ) to extract the substrings you want. In substr() function, the first argument is the column, second argument is the index from where you want to start extracting and the third argument is the length of the string you want to extract. Note: Its 1 based indexing, as opposed to being 0 based.

from pyspark.sql.functions import substring, length, expr
df = df.withColumn('col_1',expr('substring(col_s, 1, 1)'))
df = df.withColumn('col_2',expr('substring(col_s, 2, 1)'))
df.show()
+---+-----+-----+-----+
| id|col_s|col_1|col_2|
+---+-----+-----+-----+
|  1|   A1|    A|    1|
|  2|   C1|    C|    1|
|  3|   G8|    G|    8|
|  4|   Z6|    Z|    6|
+---+-----+-----+-----+

How do you create new columns from each element in a string with spark/pyspark

Question

2 answers

solution1
0 2019-02-03 05:25:54

solution2
0 ACCPTED 2019-02-03 09:48:52

How do you create new columns from each element in a string with spark/pyspark

Question

2 answers

solution1 0 2019-02-03 05:25:54

solution2 0 ACCPTED 2019-02-03 09:48:52

solution1
0 2019-02-03 05:25:54

solution2
0 ACCPTED 2019-02-03 09:48:52