简体   繁体   中英

How do you create new columns from each element in a string with spark/pyspark

I am trying to take a column in Spark (using pyspark) that has string values like 'A1', 'C2', and 'B9' and create new columns with each element in the string. How can I extract values from strings to create a new column?

How do I turn this:

| id | col_s |
|----|-------|
| 1  | 'A1'  |
| 2  | 'C2'  |

into this:

| id | col_s | col_1 | col_2 |
|----|-------|-------|-------|
| 1  | 'A1'  | 'A'   |  '1'  |
| 2  | 'C2'  | 'C'   |  '2'  |

I have been looking through the docs unsuccessfully.

I was able to answer my own question 5 minutes after posting it here...

split_col = pyspark.sql.functions.split(df['COL_NAME'], "")
df = df.withColumn('COL_NAME_CHAR', split_col.getItem(0))
df = df.withColumn('COL_NAME_NUM', split_col.getItem(1))

You can use expr (read here ) and substr (read here ) to extract the substrings you want. In substr() function, the first argument is the column, second argument is the index from where you want to start extracting and the third argument is the length of the string you want to extract. Note: Its 1 based indexing, as opposed to being 0 based.

from pyspark.sql.functions import substring, length, expr
df = df.withColumn('col_1',expr('substring(col_s, 1, 1)'))
df = df.withColumn('col_2',expr('substring(col_s, 2, 1)'))
df.show()
+---+-----+-----+-----+
| id|col_s|col_1|col_2|
+---+-----+-----+-----+
|  1|   A1|    A|    1|
|  2|   C1|    C|    1|
|  3|   G8|    G|    8|
|  4|   Z6|    Z|    6|
+---+-----+-----+-----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM