[英]Remove period then email extension after '@' into new column to extract first and last name information
I have a list of emails that are in the format firstname.lastname@email.com.我有一个格式为 firstname.lastname@email.com 的电子邮件列表。 I would like to create a new column with only the first and last name extracted from the email address.
我想创建一个新列,其中仅包含从 email 地址中提取的名字和姓氏。
I am using PySpark. This is an example of the desired output:我正在使用 PySpark。这是所需 output 的示例:
data = [{"Email": "john.doe@email.com", "Role": "manager"},
{"Email": "jane.doe@email.com", "Role": "vp"}]
df = spark.createDataFrame(data)
type(df)
# original data set
+------------------+-------+
|Email |Role |
+------------------+-------+
|john.doe@email.com|manager|
|jane.doe@email.com|vp |
+------------------|-------+
# what I want the output to look like
+------------------+-------+--------+
|Email |Role |Name |
+------------------+-------+--------+
|john.doe@email.com|manager|john doe|
|jane.doe@email.com|vp |jane doe|
+------------------|-------|--------+
How can I remove the period, replace it with a space, then drop everything after the @ into a new column to get the names like the example above?如何删除句点,将其替换为空格,然后将 @ 之后的所有内容放入新列中以获得如上例所示的名称?
It will replace the .
它将取代
.
and @...
with a space和
@...
有一个空间
which we'll have to trim from the end.我们必须从最后修剪。
from pyspark.sql import functions as F
df.withColumn('Name', F.trim(F.regexp_replace('Email', '\.|@.*', ' '))).show()
# +------------------+-------+--------+
# | Email| Role| Name|
# +------------------+-------+--------+
# |john.doe@email.com|manager|john doe|
# |jane.doe@email.com| vp|jane doe|
# +------------------+-------+--------+
You can use Python's .split
method for strings and a loop to add a "Name" field to each record in your list.您可以对字符串使用 Python 的
.split
方法,并使用循环将“名称”字段添加到列表中的每条记录。
for d in data:
d["Name] = " ".join(d["Email"].split("@")[0].split("."))
In the above loop we split the "Email" field at the "@" character, creating a list of two elements, of which we take the first one, and then split that on the character ".", which gives us the first and last name.在上面的循环中,我们在“@”字符处拆分“Email”字段,创建一个包含两个元素的列表,其中我们取第一个,然后在字符“.”处拆分它,这给了我们第一个和姓。 Then we join them with a space (" ") in between.
然后我们用一个空格 (" ") 将它们连接起来。
You can use regex_extract
and regex_replace
.您可以使用
regex_extract
和regex_replace
。
from pyspark.sql import functions as F
df = df.withColumn('Name', F.regexp_extract(
F.regexp_replace('Email', '\.', ' '),
'(.*)@',
1)
)
First, regexp_replace('Email', '\.', ' ')
will replace .
首先,
regexp_replace('Email', '\.', ' ')
将替换.
to space in Email
column.到
Email
列中的空间。
Then, regexp_extract(..., '(.*)@', 1)
will extract the 1st capture group.然后,
regexp_extract(..., '(.*)@', 1)
将提取第一个捕获组。
Regex explanation正则表达式解释
(.*) => .* is any characters with any length. Wrap with () to make a capture group.
@ => match @ mark.
(.*)@ => 1st Capture group will capture any characters before @.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.