Python - 根据部分字符串匹配将行保留在数据框中

Question

I have 2 dataframes :我有 2 个数据框：
df1 is a list of mailboxes and email ids df1 是邮箱和电子邮件 ID 的列表
df2 shows a list of approved domains df2 显示已批准域的列表

I read both the dataframes from an excel sheet我从 Excel 表中读取了两个数据框

    xls = pd.ExcelFile(input_file_shared_mailbox)
    df = pd.read_excel(xls, sheet_name = sheet_name_shared_mailbox)

i want to only keep records in df1 where df1[Email_Id] contains df2[approved_domain]我只想在 df1 中保留记录，其中 df1[Email_Id] 包含 df2[approved_domain]

    print(df1)  
        Mailbox Email_Id  
    0   mailbox1   abc@gmail.com  
    1   mailbox2   def@yahoo.com  
    2   mailbox3   ghi@msn.com  

    print(df2)  
        approved_domain  
    0   msn.com  
    1   gmail.com

and i want df3 which basically shows我想要 df3 基本上显示

    print (df3)  
        Mailbox Email_Id  
    0   mailbox1   abc@gmail.com  
    1   mailbox3   ghi@msn.com

this is the code i have right now which i think is close but i can't figure out the exact problem in the syntax这是我现在拥有的代码，我认为它很接近，但我无法弄清楚语法中的确切问题

df3 = df1[df1['Email_Id'].apply(lambda x: [item for item in x if item in df2['Approved_Domains'].tolist()])]

But get this error但是得到这个错误

TypeError: unhashable type: 'list'

i spent a lot of time researching the forum for a solution but could not find what i was looking for.我花了很多时间在论坛上寻找解决方案，但找不到我要找的东西。 appreciate all the help.感谢所有的帮助。

Answer 1

So these are the steps you will need to follow to do what you want done for your two data frames所以这些是你需要遵循的步骤来为你的两个数据框做你想做的事情

1.Split your email_address column into two separate columns 1.将您的 email_address 列拆分为两个单独的列

     df1['add'], df1['domain'] = df1['email_address'].str.split('@', 1).str

2.Then drop your add column to keep your data frame clean 2.然后删除添加列以保持数据框干净

      df1 = df1.drop('add',axis =1)

3.Get a new Data Frame with only values you want by not selecting any value in the 'domain' column that doesn't match 'approved_doman' column 3.通过不在“域”列中选择与“approved_doman”列不匹配的任何值，获取仅包含您想要的值的新数据框

      df_new = df1[~df1['domain'].isin(df2['approved_domain'])]

4. Drop the 'domain' column in df_new 4. 删除 df_new 中的 'domain' 列

      df_new = df_new.drop('domain',axis = 1)

This is what the result will be这就是结果

    mailbox     email_address
1   mailbox2    def@yahoo.com
2   mailbox3    ghi@msn.com

Answer 2

You can use dynamically created regular expression to search for the valid domain in the list and eventually filtering them out.您可以使用动态创建的正则表达式来搜索列表中的有效域并最终将其过滤掉。

Here is the code for our reference.这是我们参考的代码。

 # -*- coding: utf-8 -*-

import pandas as pd
import re

mailbox_list = [
        ['mailbox1', 'abc@gmail.com'],
        ['mailbox2', 'def@yahoo.com'],
        ['mailbox3', 'ghi@msn.com']]

valid_domains = ['msn.com', 'gmail.com']

df1 = pd.DataFrame(mailbox_list, columns=['Mailbox', 'EmailID'])
df2 = pd.DataFrame(valid_domains)

valid_list = []

for index, row in df1.iterrows():
    for idx, record in df2.iterrows():
        if re.search(rf"@{record[0]}", row[1], re.IGNORECASE):
            valid_list.append([row[0], row[1]])

df3 = pd.DataFrame(valid_list, columns=['Mailbox', 'EmailID'])
print(df3)

The output of this is:这个的输出是：

    Mailbox        EmailID
0  mailbox1  abc@gmail.com
1  mailbox3    ghi@msn.com

Answer 3

Solution解决方案

df1 = {'MailBox': ['mailbox1', 'mailbox2', 'mailbox3'], 'Email_Id': ['abc@gmail.com', 'def@yahoo.com', 'ghi@msn.com']}
df2 = {'approved_domain':['msn.com', 'gmail.com']}

mailboxes, emails = zip( # unzip the columns
    *filter( # filter 
        lambda i: any([  # i = ('mailbox1', 'abc@gmail.com')
            approved_domain in i[1] for approved_domain in df2['approved_domain']
        ]),
        zip(df1['MailBox'], df1['Email_Id']) # zip the columns
    )
)

df3 = {
    'MailBox': mailboxes, 
    'Email_I': emails
}
print(df3)

Output:输出：

> {'Email_ID': ('abc@gmail.com', 'ghi@msn.com'), 'MailBox': ('mailbox1', 'mailbox3')}

Some notes:一些注意事项：

Big chunk of this code is basically just for parsing the data structure.这段代码的很大一部分基本上只是用于解析数据结构。 The zipping and unzipping is only there to convert the list of columns to a list of rows and back.压缩和解压缩仅用于将列列表转换为行列表并返回。 If you have aa list of rows already, you just have to do the filtering part如果你已经有一个行列表，你只需要做过滤部分

Python - 根据部分字符串匹配将行保留在数据框中

问题描述

3 个解决方案

解决方案1
2 已采纳 2020-04-01 04:07:41

解决方案2
2 2020-04-01 04:18:00

解决方案3
1 2020-04-01 03:45:26

Solution解决方案

Output:输出：

Some notes:一些注意事项：

Python - 根据部分字符串匹配将行保留在数据框中

问题描述

3 个解决方案

解决方案1 2 已采纳 2020-04-01 04:07:41

解决方案2 2 2020-04-01 04:18:00

解决方案3 1 2020-04-01 03:45:26

Solution解决方案

Output:输出：

Some notes:一些注意事项：

解决方案1
2 已采纳 2020-04-01 04:07:41

解决方案2
2 2020-04-01 04:18:00

解决方案3
1 2020-04-01 03:45:26