简体   繁体   English

Python - 根据部分字符串匹配将行保留在数据框中

[英]Python - keep rows in dataframe based on partial string match

I have 2 dataframes :我有 2 个数据框:
df1 is a list of mailboxes and email ids df1 是邮箱和电子邮件 ID 的列表
df2 shows a list of approved domains df2 显示已批准域的列表

I read both the dataframes from an excel sheet我从 Excel 表中读取了两个数据框

    xls = pd.ExcelFile(input_file_shared_mailbox)
    df = pd.read_excel(xls, sheet_name = sheet_name_shared_mailbox)

i want to only keep records in df1 where df1[Email_Id] contains df2[approved_domain]我只想在 df1 中保留记录,其中 df1[Email_Id] 包含 df2[approved_domain]

    print(df1)  
        Mailbox Email_Id  
    0   mailbox1   abc@gmail.com  
    1   mailbox2   def@yahoo.com  
    2   mailbox3   ghi@msn.com  

    print(df2)  
        approved_domain  
    0   msn.com  
    1   gmail.com  

and i want df3 which basically shows我想要 df3 基本上显示

    print (df3)  
        Mailbox Email_Id  
    0   mailbox1   abc@gmail.com  
    1   mailbox3   ghi@msn.com  

this is the code i have right now which i think is close but i can't figure out the exact problem in the syntax这是我现在拥有的代码,我认为它很接近,但我无法弄清楚语法中的确切问题

df3 = df1[df1['Email_Id'].apply(lambda x: [item for item in x if item in df2['Approved_Domains'].tolist()])]

But get this error但是得到这个错误

TypeError: unhashable type: 'list'

i spent a lot of time researching the forum for a solution but could not find what i was looking for.我花了很多时间在论坛上寻找解决方案,但找不到我要找的东西。 appreciate all the help.感谢所有的帮助。

So these are the steps you will need to follow to do what you want done for your two data frames所以这些是你需要遵循的步骤来为你的两个数据框做你想做的事情

1.Split your email_address column into two separate columns 1.将您的 email_address 列拆分为两个单独的列

     df1['add'], df1['domain'] = df1['email_address'].str.split('@', 1).str

2.Then drop your add column to keep your data frame clean 2.然后删除添加列以保持数据框干净

      df1 = df1.drop('add',axis =1)

3.Get a new Data Frame with only values you want by not selecting any value in the 'domain' column that doesn't match 'approved_doman' column 3.通过不在“域”列中选择与“approved_doman”列不匹配的任何值,获取仅包含您想要的值的新数据框

      df_new = df1[~df1['domain'].isin(df2['approved_domain'])]

4. Drop the 'domain' column in df_new 4. 删除 df_new 中的 'domain' 列

      df_new = df_new.drop('domain',axis = 1)

This is what the result will be这就是结果

    mailbox     email_address
1   mailbox2    def@yahoo.com
2   mailbox3    ghi@msn.com

You can use dynamically created regular expression to search for the valid domain in the list and eventually filtering them out.您可以使用动态创建的正则表达式来搜索列表中的有效域并最终将其过滤掉。

Here is the code for our reference.这是我们参考的代码。

 # -*- coding: utf-8 -*-

import pandas as pd
import re

mailbox_list = [
        ['mailbox1', 'abc@gmail.com'],
        ['mailbox2', 'def@yahoo.com'],
        ['mailbox3', 'ghi@msn.com']]

valid_domains = ['msn.com', 'gmail.com']

df1 = pd.DataFrame(mailbox_list, columns=['Mailbox', 'EmailID'])
df2 = pd.DataFrame(valid_domains)

valid_list = []

for index, row in df1.iterrows():
    for idx, record in df2.iterrows():
        if re.search(rf"@{record[0]}", row[1], re.IGNORECASE):
            valid_list.append([row[0], row[1]])

df3 = pd.DataFrame(valid_list, columns=['Mailbox', 'EmailID'])
print(df3)

The output of this is:这个的输出是:

    Mailbox        EmailID
0  mailbox1  abc@gmail.com
1  mailbox3    ghi@msn.com

Solution解决方案

df1 = {'MailBox': ['mailbox1', 'mailbox2', 'mailbox3'], 'Email_Id': ['abc@gmail.com', 'def@yahoo.com', 'ghi@msn.com']}
df2 = {'approved_domain':['msn.com', 'gmail.com']}

mailboxes, emails = zip( # unzip the columns
    *filter( # filter 
        lambda i: any([  # i = ('mailbox1', 'abc@gmail.com')
            approved_domain in i[1] for approved_domain in df2['approved_domain']
        ]),
        zip(df1['MailBox'], df1['Email_Id']) # zip the columns
    )
)

df3 = {
    'MailBox': mailboxes, 
    'Email_I': emails
}
print(df3)

Output:输出:

> {'Email_ID': ('abc@gmail.com', 'ghi@msn.com'), 'MailBox': ('mailbox1', 'mailbox3')}                                                                                                                                                             

Some notes:一些注意事项:

Big chunk of this code is basically just for parsing the data structure.这段代码的很大一部分基本上只是用于解析数据结构。 The zipping and unzipping is only there to convert the list of columns to a list of rows and back.压缩和解压缩仅用于将列列表转换为行列表并返回。 If you have aa list of rows already, you just have to do the filtering part如果你已经有一个行列表,你只需要做过滤部分

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据部分字符串匹配,使用 pandas 在 python 中过滤 dataframe - Filter dataframe in python using pandas based on partial string match Python Pandas dataframe中的字符串列表部分匹配 - Python Pandas partial match of list of string in dataframe 如何根据部分字符串匹配将两个数据框连接起来? - How would I join two dataframe based on a partial string match? Python Pandas dataframe中字符串列表的部分匹配并返回所有匹配的部分字符串 - Python Pandas partial match of list of string in dataframe and return all match partial string Python 基于部分匹配添加新数据的新行 - Python Add in new rows with new data based on Partial Match Python基于字符串索引向DataFrame添加行 - Python adding rows to DataFrame based on string indexing python基于部分字符串匹配合并两个pandas数据帧 - python merge two pandas data frames based on partial string match 在数据框中搜索部分字符串匹配项,并将行仅包含其ID放入新的数据框中 - Search through a dataframe for a partial string match and put the rows into a new dataframe with only their IDs Python:过滤pandas数据帧以保持基于列的指定行数 - Python: filter pandas dataframe to keep specified number of rows based on a column 基于部分匹配合并两个pandas DataFrame - Merge two pandas DataFrame based on partial match
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM