I have 2 dataframes :
df1 is a list of mailboxes and email ids
df2 shows a list of approved domains
I read both the dataframes from an excel sheet
xls = pd.ExcelFile(input_file_shared_mailbox)
df = pd.read_excel(xls, sheet_name = sheet_name_shared_mailbox)
i want to only keep records in df1 where df1[Email_Id] contains df2[approved_domain]
print(df1)
Mailbox Email_Id
0 mailbox1 abc@gmail.com
1 mailbox2 def@yahoo.com
2 mailbox3 ghi@msn.com
print(df2)
approved_domain
0 msn.com
1 gmail.com
and i want df3 which basically shows
print (df3)
Mailbox Email_Id
0 mailbox1 abc@gmail.com
1 mailbox3 ghi@msn.com
this is the code i have right now which i think is close but i can't figure out the exact problem in the syntax
df3 = df1[df1['Email_Id'].apply(lambda x: [item for item in x if item in df2['Approved_Domains'].tolist()])]
But get this error
TypeError: unhashable type: 'list'
i spent a lot of time researching the forum for a solution but could not find what i was looking for. appreciate all the help.
So these are the steps you will need to follow to do what you want done for your two data frames
1.Split your email_address column into two separate columns
df1['add'], df1['domain'] = df1['email_address'].str.split('@', 1).str
2.Then drop your add column to keep your data frame clean
df1 = df1.drop('add',axis =1)
3.Get a new Data Frame with only values you want by not selecting any value in the 'domain' column that doesn't match 'approved_doman' column
df_new = df1[~df1['domain'].isin(df2['approved_domain'])]
4. Drop the 'domain' column in df_new
df_new = df_new.drop('domain',axis = 1)
This is what the result will be
mailbox email_address
1 mailbox2 def@yahoo.com
2 mailbox3 ghi@msn.com
You can use dynamically created regular expression to search for the valid domain in the list and eventually filtering them out.
Here is the code for our reference.
# -*- coding: utf-8 -*-
import pandas as pd
import re
mailbox_list = [
['mailbox1', 'abc@gmail.com'],
['mailbox2', 'def@yahoo.com'],
['mailbox3', 'ghi@msn.com']]
valid_domains = ['msn.com', 'gmail.com']
df1 = pd.DataFrame(mailbox_list, columns=['Mailbox', 'EmailID'])
df2 = pd.DataFrame(valid_domains)
valid_list = []
for index, row in df1.iterrows():
for idx, record in df2.iterrows():
if re.search(rf"@{record[0]}", row[1], re.IGNORECASE):
valid_list.append([row[0], row[1]])
df3 = pd.DataFrame(valid_list, columns=['Mailbox', 'EmailID'])
print(df3)
The output of this is:
Mailbox EmailID
0 mailbox1 abc@gmail.com
1 mailbox3 ghi@msn.com
df1 = {'MailBox': ['mailbox1', 'mailbox2', 'mailbox3'], 'Email_Id': ['abc@gmail.com', 'def@yahoo.com', 'ghi@msn.com']}
df2 = {'approved_domain':['msn.com', 'gmail.com']}
mailboxes, emails = zip( # unzip the columns
*filter( # filter
lambda i: any([ # i = ('mailbox1', 'abc@gmail.com')
approved_domain in i[1] for approved_domain in df2['approved_domain']
]),
zip(df1['MailBox'], df1['Email_Id']) # zip the columns
)
)
df3 = {
'MailBox': mailboxes,
'Email_I': emails
}
print(df3)
> {'Email_ID': ('abc@gmail.com', 'ghi@msn.com'), 'MailBox': ('mailbox1', 'mailbox3')}
Big chunk of this code is basically just for parsing the data structure. The zipping and unzipping is only there to convert the list of columns to a list of rows and back. If you have aa list of rows already, you just have to do the filtering part
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.