简体   繁体   中英

Python - keep rows in dataframe based on partial string match

I have 2 dataframes :
df1 is a list of mailboxes and email ids
df2 shows a list of approved domains

I read both the dataframes from an excel sheet

    xls = pd.ExcelFile(input_file_shared_mailbox)
    df = pd.read_excel(xls, sheet_name = sheet_name_shared_mailbox)

i want to only keep records in df1 where df1[Email_Id] contains df2[approved_domain]

    print(df1)  
        Mailbox Email_Id  
    0   mailbox1   abc@gmail.com  
    1   mailbox2   def@yahoo.com  
    2   mailbox3   ghi@msn.com  

    print(df2)  
        approved_domain  
    0   msn.com  
    1   gmail.com  

and i want df3 which basically shows

    print (df3)  
        Mailbox Email_Id  
    0   mailbox1   abc@gmail.com  
    1   mailbox3   ghi@msn.com  

this is the code i have right now which i think is close but i can't figure out the exact problem in the syntax

df3 = df1[df1['Email_Id'].apply(lambda x: [item for item in x if item in df2['Approved_Domains'].tolist()])]

But get this error

TypeError: unhashable type: 'list'

i spent a lot of time researching the forum for a solution but could not find what i was looking for. appreciate all the help.

So these are the steps you will need to follow to do what you want done for your two data frames

1.Split your email_address column into two separate columns

     df1['add'], df1['domain'] = df1['email_address'].str.split('@', 1).str

2.Then drop your add column to keep your data frame clean

      df1 = df1.drop('add',axis =1)

3.Get a new Data Frame with only values you want by not selecting any value in the 'domain' column that doesn't match 'approved_doman' column

      df_new = df1[~df1['domain'].isin(df2['approved_domain'])]

4. Drop the 'domain' column in df_new

      df_new = df_new.drop('domain',axis = 1)

This is what the result will be

    mailbox     email_address
1   mailbox2    def@yahoo.com
2   mailbox3    ghi@msn.com

You can use dynamically created regular expression to search for the valid domain in the list and eventually filtering them out.

Here is the code for our reference.

 # -*- coding: utf-8 -*-

import pandas as pd
import re

mailbox_list = [
        ['mailbox1', 'abc@gmail.com'],
        ['mailbox2', 'def@yahoo.com'],
        ['mailbox3', 'ghi@msn.com']]

valid_domains = ['msn.com', 'gmail.com']

df1 = pd.DataFrame(mailbox_list, columns=['Mailbox', 'EmailID'])
df2 = pd.DataFrame(valid_domains)

valid_list = []

for index, row in df1.iterrows():
    for idx, record in df2.iterrows():
        if re.search(rf"@{record[0]}", row[1], re.IGNORECASE):
            valid_list.append([row[0], row[1]])

df3 = pd.DataFrame(valid_list, columns=['Mailbox', 'EmailID'])
print(df3)

The output of this is:

    Mailbox        EmailID
0  mailbox1  abc@gmail.com
1  mailbox3    ghi@msn.com

Solution

df1 = {'MailBox': ['mailbox1', 'mailbox2', 'mailbox3'], 'Email_Id': ['abc@gmail.com', 'def@yahoo.com', 'ghi@msn.com']}
df2 = {'approved_domain':['msn.com', 'gmail.com']}

mailboxes, emails = zip( # unzip the columns
    *filter( # filter 
        lambda i: any([  # i = ('mailbox1', 'abc@gmail.com')
            approved_domain in i[1] for approved_domain in df2['approved_domain']
        ]),
        zip(df1['MailBox'], df1['Email_Id']) # zip the columns
    )
)

df3 = {
    'MailBox': mailboxes, 
    'Email_I': emails
}
print(df3)

Output:

> {'Email_ID': ('abc@gmail.com', 'ghi@msn.com'), 'MailBox': ('mailbox1', 'mailbox3')}                                                                                                                                                             

Some notes:

Big chunk of this code is basically just for parsing the data structure. The zipping and unzipping is only there to convert the list of columns to a list of rows and back. If you have aa list of rows already, you just have to do the filtering part

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM