简体   繁体   中英

PySpark: Get first character of each word in string

For an assignment I have been asked to shorten the names of clients to only the first letter of each name where they are separated by a space character.

I found a lot of solutions for this in Python, but I am not able to translate this to a dataframe.

The DF looks like this:

| ID       | Name           | 
| -------- | -------------- |
| 1        | John Doe       |
| 2        | Roy Lee Winters|
| 3        | Mary-Kate Baron|

My desired output would be:

| ID | Name | Shortened_name| 
| -------- | -------- | -------------- |
| 1 | John Doe | JD |
| 2 | Roy Lee Winters | RLW |
| 3 | Mary-Kate Baron | MB |

I've had some result with the code below but this is not working when there are more then 2 names. I would also like to have some more 'flexible' code as some people have 4 or 5 names where others only have 1.

df.withColumn("col1", F.substring(F.split(F.col("Name"), " ").getItem(0), 1, 1))\
  .withColumn("col2", F.substring(F.split(F.col("Name"), " ").getItem(1), 1, 1))\
  .withColumn('Shortened_name', F.concat('col1', 'col2'))

You can split the Name column then use transform function on the resulting array to get first letter of each element:

from pyspark.sql import functions as F

df = spark.createDataFrame([(1, "John Doe"), (2, "Roy Lee Winters"), (3, "Mary-Kate Baron")], ["ID", "Name"])

df1 = df.withColumn(
    "Shortened_name",
    F.array_join(F.expr("transform(split(Name, ' '), x -> left(x, 1))"), "")
)

df1.show()
# +---+---------------+--------------+
# | ID|           Name|Shortened_name|
# +---+---------------+--------------+
# |  1|       John Doe|            JD|
# |  2|Roy Lee Winters|           RLW|
# |  3|Mary-Kate Baron|            MB|
# +---+---------------+--------------+

Or by using aggregate function:

df1 = df.withColumn(
    "Shortened_name",
    F.expr("aggregate(split(Name, ' '), '', (acc, x) -> acc || left(x, 1))")
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM