I have a dataframe with a column email_adress_raw
containing multiple email addresses in each row and I want to create a new column with the first email address that has an specific email ending listed in a long list.
email_endings = ['email_end1.com','email_end2.com','email_end3.com',...]
I created the following function, which is already working, but as the list is quite long and is constantly under construction, I would like to do an iteration over the list inside the code or something similiar. I already thought of a loop, but somehow I don't manage to do it...
def email_address_new(s):
try:
r = re.search("([\w.-]+@"+email_endings[0]+"|[\w.-]+@"+email_endings[1]+"|[\w.-]+@"+email_endings[2]+")", s).group()
except AttributeError:
print(s)
return None
except TypeError:
print(s)
return None
return r
udf_email_address_new= F.udf(email_address_new, StringType())
df = df.withColumn("email", udf_email_address_new(F.col("email_adress_raw")))
You can use join
to combine the email endings in the list to the regex pattern:
email_endings = ['email_end1.com','email_end2.com','email_end3.com']
def email_address_new(s):
try:
pattern = "([\w.-]+@" + "|[\w.-]+@".join(email_endings) + ")"
r = re.search(pattern, s).group()
except AttributeError:
print(s)
return None
except TypeError:
print(s)
return None
return r
udf_email_address_new= F.udf(email_address_new, StringType())
df2 = df.withColumn("email", udf_email_address_new(F.col("email_adress_raw")))
But you probably don't need a UDF for this purpose. You can just use regexp_extract
, and replace the empty strings with null
if there is no match ( regexp_extract
returns an empty string if it cannot match)
import pyspark.sql.functions as F
email_endings = ['email_end1.com','email_end2.com','email_end3.com']
pattern = "([\w.-]+@" + "|[\w.-]+@".join(email_endings) + ")"
df2 = df.withColumn(
"email",
F.when(
F.regexp_extract(F.col("email_adress_raw"), pattern, 1) != "",
F.regexp_extract(F.col("email_adress_raw"), pattern, 1)
)
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.