简体   繁体   中英

Extract features from text data on python

I have a dataframe from pandas like this.

ID            email
1            abc@google.com
2            abc@facebook.com
3            abc@GOOGLE.COM
4            abc@tesla.com
5            abc@hilton.com
6            abc@FaceBook.com

I want to learn company from email(after @).Sample output like this.

Sample output

ID            email                WorkGoogle     WorkFacebook    etc.....
1            abc@google.com          Yes             No              ..
2            abc@facebook.com        No              Yes             .. 
3            abc@GOOGLE.com          Yes             No               ..   
4            abc@tesla.com           No              No              ..
5            abc@hilton.com          No              No              ..
6            abc@FaceBook.com        No              Yes             ..

Need to care Uppercase lowercase.

workplace = df.email.rename("workplace").apply(lambda x: x.split("@")[1].lower())
pd.concat([df,
           pd.DataFrame(workplace.apply(lambda x: {x: "Yes"}).to_list(), index=df.index)],
          axis=1).fillna("No")

#    ID             email google.com facebook.com tesla.com hilton.com
# 0   1    abc@google.com        Yes           No        No         No
# 1   2  abc@facebook.com         No          Yes        No         No
# 2   3    abc@GOOGLE.COM        Yes           No        No         No
# 3   4     abc@tesla.com         No           No       Yes         No
# 4   5    abc@hilton.com         No           No        No        Yes
# 5   6  abc@FaceBook.com         No          Yes        No         No

But maybe you can just add a column instead of multiple

df["workplace"] = df.email.rename("workplace").str.lower().str.split("@").str[1]
# Then you could do
df.groupby("workplace").agg(list)
#                   ID                                 email
# workplace                                                 
# facebook.com  [2, 6]  [abc@facebook.com, abc@FaceBook.com]
# google.com    [1, 3]      [abc@google.com, abc@GOOGLE.COM]
# hilton.com       [5]                      [abc@hilton.com]
# tesla.com        [4]                       [abc@tesla.com]

here is the dynamic way without looping, using pivot_table :

df['domain'] = df['email'].str.split('@').str[1].str.split('.').str[0].str.lower()
df = df.pivot_table(index=['ID','email'], columns='domain',aggfunc=lambda x: 'Yes' if len(x)> 0 else 'No', fill_value='No')

output:

>>
domain              facebook google hilton tesla
ID email                                        
1  abc@google.com         No    Yes     No    No
2  abc@facebook.com      Yes     No     No    No
3  abc@GOOGLE.COM         No    Yes     No    No
4  abc@tesla.com          No     No     No   Yes
5  abc@hilton.com         No     No    Yes    No
6  abc@FaceBook.com      Yes     No     No    No

Assuming df is your dataframe, please try this:

import numpy as np
df['workplace'] = df['email'].str.split('@',1).apply(lambda x:x[1].split('.',1)[0])
for workplace in df['workplace'].unique():
  df.loc[:,'Work'+workplace] = 'No'
  df['Work'+workplace] = np.where(df['workplace']==workplace,'Yes','No')
df = df.drop(columns=['workplace'],axis=1)

if you are OK with manually making the new column for each domain, you can use this

df['WorkGoogle'] = df['email'].str.lower().str.contains('google')
df['WorkFacebook'] = df['email'].str.lower().str.contains('facebook')
# etc etc

That will give you True/False rather than Yes/No. If you want Yes/No, you can map

df['WorkGoogle'] = df['email'].str.lower().str.contains('google').map({True:'Yes',False:'No'})

FYI: this solution is not performance efficient. I am sure in the comments on this answer, you may find a more efficient solution

I would first make a list of all companies by saying:

companies = set([email.split('@')[1].split('.')[0].lower() for email in df['email']])

Then simply iterate over this:

for company in companies:
    df['Work'+company.capitalize()] = df['email'].apply(lambda x: x.split("@")[1].lower()).str.contains(company)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM