[英]Python - keep rows in dataframe based on partial string match
I have 2 dataframes :我有 2 个数据框:
df1 is a list of mailboxes and email ids df1 是邮箱和电子邮件 ID 的列表
df2 shows a list of approved domains df2 显示已批准域的列表
I read both the dataframes from an excel sheet我从 Excel 表中读取了两个数据框
xls = pd.ExcelFile(input_file_shared_mailbox)
df = pd.read_excel(xls, sheet_name = sheet_name_shared_mailbox)
i want to only keep records in df1 where df1[Email_Id] contains df2[approved_domain]我只想在 df1 中保留记录,其中 df1[Email_Id] 包含 df2[approved_domain]
print(df1)
Mailbox Email_Id
0 mailbox1 abc@gmail.com
1 mailbox2 def@yahoo.com
2 mailbox3 ghi@msn.com
print(df2)
approved_domain
0 msn.com
1 gmail.com
and i want df3 which basically shows我想要 df3 基本上显示
print (df3)
Mailbox Email_Id
0 mailbox1 abc@gmail.com
1 mailbox3 ghi@msn.com
this is the code i have right now which i think is close but i can't figure out the exact problem in the syntax这是我现在拥有的代码,我认为它很接近,但我无法弄清楚语法中的确切问题
df3 = df1[df1['Email_Id'].apply(lambda x: [item for item in x if item in df2['Approved_Domains'].tolist()])]
But get this error但是得到这个错误
TypeError: unhashable type: 'list'
i spent a lot of time researching the forum for a solution but could not find what i was looking for.我花了很多时间在论坛上寻找解决方案,但找不到我要找的东西。 appreciate all the help.
感谢所有的帮助。
So these are the steps you will need to follow to do what you want done for your two data frames所以这些是你需要遵循的步骤来为你的两个数据框做你想做的事情
1.Split your email_address column into two separate columns 1.将您的 email_address 列拆分为两个单独的列
df1['add'], df1['domain'] = df1['email_address'].str.split('@', 1).str
2.Then drop your add column to keep your data frame clean 2.然后删除添加列以保持数据框干净
df1 = df1.drop('add',axis =1)
3.Get a new Data Frame with only values you want by not selecting any value in the 'domain' column that doesn't match 'approved_doman' column 3.通过不在“域”列中选择与“approved_doman”列不匹配的任何值,获取仅包含您想要的值的新数据框
df_new = df1[~df1['domain'].isin(df2['approved_domain'])]
4. Drop the 'domain' column in df_new 4. 删除 df_new 中的 'domain' 列
df_new = df_new.drop('domain',axis = 1)
This is what the result will be这就是结果
mailbox email_address
1 mailbox2 def@yahoo.com
2 mailbox3 ghi@msn.com
You can use dynamically created regular expression to search for the valid domain in the list and eventually filtering them out.您可以使用动态创建的正则表达式来搜索列表中的有效域并最终将其过滤掉。
Here is the code for our reference.这是我们参考的代码。
# -*- coding: utf-8 -*-
import pandas as pd
import re
mailbox_list = [
['mailbox1', 'abc@gmail.com'],
['mailbox2', 'def@yahoo.com'],
['mailbox3', 'ghi@msn.com']]
valid_domains = ['msn.com', 'gmail.com']
df1 = pd.DataFrame(mailbox_list, columns=['Mailbox', 'EmailID'])
df2 = pd.DataFrame(valid_domains)
valid_list = []
for index, row in df1.iterrows():
for idx, record in df2.iterrows():
if re.search(rf"@{record[0]}", row[1], re.IGNORECASE):
valid_list.append([row[0], row[1]])
df3 = pd.DataFrame(valid_list, columns=['Mailbox', 'EmailID'])
print(df3)
The output of this is:这个的输出是:
Mailbox EmailID
0 mailbox1 abc@gmail.com
1 mailbox3 ghi@msn.com
df1 = {'MailBox': ['mailbox1', 'mailbox2', 'mailbox3'], 'Email_Id': ['abc@gmail.com', 'def@yahoo.com', 'ghi@msn.com']}
df2 = {'approved_domain':['msn.com', 'gmail.com']}
mailboxes, emails = zip( # unzip the columns
*filter( # filter
lambda i: any([ # i = ('mailbox1', 'abc@gmail.com')
approved_domain in i[1] for approved_domain in df2['approved_domain']
]),
zip(df1['MailBox'], df1['Email_Id']) # zip the columns
)
)
df3 = {
'MailBox': mailboxes,
'Email_I': emails
}
print(df3)
> {'Email_ID': ('abc@gmail.com', 'ghi@msn.com'), 'MailBox': ('mailbox1', 'mailbox3')}
Big chunk of this code is basically just for parsing the data structure.这段代码的很大一部分基本上只是用于解析数据结构。 The zipping and unzipping is only there to convert the list of columns to a list of rows and back.
压缩和解压缩仅用于将列列表转换为行列表并返回。 If you have aa list of rows already, you just have to do the filtering part
如果你已经有一个行列表,你只需要做过滤部分
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.