[英]Python RegEx with List as search variables
I have a dataframe with a column email_adress_raw
containing multiple email addresses in each row and I want to create a new column with the first email address that has an specific email ending listed in a long list.
email_endings = ['email_end1.com','email_end2.com','email_end3.com',...]
我創建了以下 function,它已經在工作,但是由於列表很長並且不斷在構建中,我想對代碼中的列表或類似的東西進行迭代。 我已經想到了一個循環,但不知何故我無法做到......
def email_address_new(s):
try:
r = re.search("([\w.-]+@"+email_endings[0]+"|[\w.-]+@"+email_endings[1]+"|[\w.-]+@"+email_endings[2]+")", s).group()
except AttributeError:
print(s)
return None
except TypeError:
print(s)
return None
return r
udf_email_address_new= F.udf(email_address_new, StringType())
df = df.withColumn("email", udf_email_address_new(F.col("email_adress_raw")))
您可以使用join
將列表中的 email 結尾組合到正則表達式模式:
email_endings = ['email_end1.com','email_end2.com','email_end3.com']
def email_address_new(s):
try:
pattern = "([\w.-]+@" + "|[\w.-]+@".join(email_endings) + ")"
r = re.search(pattern, s).group()
except AttributeError:
print(s)
return None
except TypeError:
print(s)
return None
return r
udf_email_address_new= F.udf(email_address_new, StringType())
df2 = df.withColumn("email", udf_email_address_new(F.col("email_adress_raw")))
但是您可能不需要為此目的使用 UDF。 您可以只使用regexp_extract
,如果不匹配,則用null
替換空字符串(如果不匹配,則regexp_extract
返回一個空字符串)
import pyspark.sql.functions as F
email_endings = ['email_end1.com','email_end2.com','email_end3.com']
pattern = "([\w.-]+@" + "|[\w.-]+@".join(email_endings) + ")"
df2 = df.withColumn(
"email",
F.when(
F.regexp_extract(F.col("email_adress_raw"), pattern, 1) != "",
F.regexp_extract(F.col("email_adress_raw"), pattern, 1)
)
)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.