简体   繁体   中英

Remove period then email extension after '@' into new column to extract first and last name information

I have a list of emails that are in the format firstname.lastname@email.com. I would like to create a new column with only the first and last name extracted from the email address.

I am using PySpark. This is an example of the desired output:

data = [{"Email": "john.doe@email.com", "Role": "manager"},
{"Email": "jane.doe@email.com", "Role": "vp"}]

df = spark.createDataFrame(data)

type(df)

# original data set
+------------------+-------+
|Email             |Role   |
+------------------+-------+
|john.doe@email.com|manager|
|jane.doe@email.com|vp     |
+------------------|-------+

# what I want the output to look like
+------------------+-------+--------+
|Email             |Role   |Name    |
+------------------+-------+--------+
|john.doe@email.com|manager|john doe|
|jane.doe@email.com|vp     |jane doe|
+------------------|-------|--------+

How can I remove the period, replace it with a space, then drop everything after the @ into a new column to get the names like the example above?

It will replace the . and @... with a space which we'll have to trim from the end.

from pyspark.sql import functions as F

df.withColumn('Name', F.trim(F.regexp_replace('Email', '\.|@.*', ' '))).show()
# +------------------+-------+--------+
# |             Email|   Role|    Name|
# +------------------+-------+--------+
# |john.doe@email.com|manager|john doe|
# |jane.doe@email.com|     vp|jane doe|
# +------------------+-------+--------+

You can use Python's .split method for strings and a loop to add a "Name" field to each record in your list.

for d in data:
    d["Name] = " ".join(d["Email"].split("@")[0].split("."))

In the above loop we split the "Email" field at the "@" character, creating a list of two elements, of which we take the first one, and then split that on the character ".", which gives us the first and last name. Then we join them with a space (" ") in between.

You can use regex_extract and regex_replace .

from pyspark.sql import functions as F
df = df.withColumn('Name', F.regexp_extract(
        F.regexp_replace('Email', '\.', ' '), 
        '(.*)@', 
        1)
     )

First, regexp_replace('Email', '\.', ' ') will replace . to space in Email column.

Then, regexp_extract(..., '(.*)@', 1) will extract the 1st capture group.

Regex explanation

(.*) => .* is any characters with any length. Wrap with () to make a capture group.
@ => match @ mark. 

(.*)@ => 1st Capture group will capture any characters before @.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM